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SECTION  I 


INTRODUCTION  AND  OVERVIEW 


Much  research  has  been  done  in  Government,  Business,  and  Industry 
to  obtain  the  capability  to  predict  future  occurrences  (events) . 

The  need  for  such  a capability  is  demonstrated  by  the  time,  money 
and,  in  some  cases,  lives,  which  can  be  saved  using  these  predic- 
tions. For  instance,  studies  are  made  to  predict  such  things  as 
the  number  of  red  blood  cells  in  a blood  sample  based  upon  the 
packed-cell  volume  of  the  blood,  the  average  grade  a student  makes 
on  a standardized  test  based  upon  their  I.Q.s,  the  death  rate  of 
males  being  exposed  to  the  environmental  conditions  of  a coal  mine 
for  over  a 10  year  period,  the  cost  of  a piece  of  avionics  equip- 
ment based  upon  its  physical  characteristics,  etc. 

Past  experience  is  usually  the  only  means  of  predicting  the  future. 
The  approaches  to  prediction  can  take  forms  ranging  from  hard 
objective  evidence  (which  is  rarely  the  case)  to  pure  speculation. 

As  an  example  of  hard  objective  evidence,  if  you  drop  an  apple  from 
a cliff,  most  people  will  agree  with  the  prediction  that  the  apple 
will  hit  the  ground.  Accounting  type  models  are  useful  estimating 
tools  but  require  a large  amount  of  detailed  information.  Such 
things  as  estimating  the  cost  of  a piece  of  equipment  early  in 
the  conceptual/preliminary  design  phase,  however,  does  not  usually 
have  the  luxury  of  hard  objective  evidence  or  such  detailed  infor- 
mation on  which  to  base  decisions.  Other  approaches  to  prediction, 
such  as  the  subjective  approach,  relies  on  the  opinions  of  qualified 
expert’s*  irt  the' field  of*  “study . - And  then  of  course,  there  is  the 
"crystal  ball"  approach. 

Once  an  estimate  is  made,  however,  there  is  an  obvious  question, 

"How  accurate  is  the  estimate?"  Because  of  certain  constraints 
(such  as  time,  money  and  scope  of  the  study)  some  approaches  to  esti- 
mation are  the  only  ones  possible,  but  there  is  a major  drawback  in 
that  the  merits  of  future  predictions  usually  cannot  be  quantified. 
Mathematicians  and  statisticians  have  developed  (and  are  still 
developing)  many  techniques  for  estimating  purposes  with  particular 


emphasis  on  quantifying  the  reliability  of  estimates.  A major 
area  of  statistics  that  has  been  used  for  over  a century  is  that 
of  Regression  Analysis.  This  is  the  approach  to  estimation  we 
take . 

Inherent  in  the  interpretation  of  the  words  prediction  or  estimate 
is  the  term  uncertainty.  It  would  be  nice  to  make  "exact"  predic- 
tions, but  this  is  rarely  the  case  when  dealing  with  a mass  of 
statistical  data.  Thus,  statisticians  do  not  profess  to  estimate 
exactly,  but  that  their  predictions  are  "on  the  average"  reasonably 
close.  The  basic  concept  of  Regression  Analysis  is  then  to  esti- 
mate the  average  value  of  a given  variable  (called  the  dependent 
variable)  in  terms  of  the  known  values  of  one  or  more  other 
variables  (called  independent  variables) . Regression  Analysis 
expresses  the  relationships  of  these  variables  by  determining 
the  form  of  a mathematical  equation  connecting  them.  In 
other  words,  there  are  three  major  questions  that  are  asked  in 
Regression  Analysis: 

(1)  Is  there  a relationship  between  the  dependent  and  the 
independent  variables?; 

(2)  If  there  is  a relationship,  how  can  it  be  "best" 
expressed  in  the  form  of  a mathematical  equation?;  and 

(3)  Vihat  statistics,  plots,  techniques,  etc.,  can  be  used  to 
verify  the  accuracy  of  the  equation  obtained? 

For  instance,  if  a study  is  made  to  estimate  the  average  weight  of 
a female  in  a given  university  based  upon  her  height,  the  procedure 
would  be  to  select  a "representative  sample"  of  the  females  in  the 
university,  record  both  their  heights  and  weights  (the  data)  and  try 
to  fit  the  "best"  mathematical  relationship  that  connects  weight  to 
height.  In  many  cases,  as  in  estimating  equipment  costs,  one  inde- 
pendent variable  does  not  provide  enough  information  to  accurately 
predict  the  dependent  variable.  Considering  additional  independent 
variables  can,  in  most  cases,  lead  to  more  accurate  estimates, 
since  more  information  should  lead  to  better  predictions. 

The  purpose  of  this  study  is  to  estimate  the  Operations  and  Mainte- 
nance (O&M)  cost  of  avionics  equipment,  based  upon  the  physical 
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characteristics  of  the  equipment  in  addition  to  any  current  infor- 
mation available  (such  as  the  type  of  aircraft  in  which  the  equip- 
ment is  used  and  the  equipment's  avionics  area)  early  in  the  concep- 
tual/preliminary design  phase.  The  tool  used  to  estimate  avionics 
O&M  costs  is  a computer  model  developed  by  the  Logistics  Engineering 
Section  of  Westinghouse  for  the  Air  Force  Avionics  Laboratory  (AFAL) , 
and  is  called  the  Avionics  Laboratory  Predictive  Operations  and 
Support  (ALPOS)  Model.  The  ALPOS  model  is  highly  dependent  on  the 
six  estimating  relationships  obtained  for  the  logistics,  support 
and  cost  parameters:  Maintenance  Manhours  per  Operating  Hour  (MMH/OH) ; 
Mean  Time  Between  Failure  (MTBF ) ; Mean  Time  Between  Maintenance 
Actions  (MTBMA) ; Logistic  Support  Costs  per  Operating  Hour  (LSC/OH) ; 
Training  Cost  per  Operating  Hour  (TRAIN/OH) ; and  the  fraction 
Not  Repairable  this  Station  (NRTS) . The  approach  was  to  collect 
data  consisting  of  21  independent  variables  covering  a wide  spec- 
trum of  avionics  equipment,  develop  Cost-Estimating  Relationships 
(CERs) , i.e.  Regression  equations  where  the  dependent  variable  is 
cost  (LSC/OH,  TRAIN/OH),  and  Parametric  Estimating  Relationships 
(PERs) , Regression  equations  where  the  dependent  variable  is  a 
parameter  which  drives  cost  (MTBF,  MTBMA,  MMH/OH,  NRTS)  by  means  of 
Multiple  Regression  Analyses.  Other  parameters  which  drive  O&M 
cost,  such  as  spares  cost  and  support  equipment  cost , are  not  esti- 
mated using  regressions, since  there  are  many  other  subjective 
variables  affecting  these  parameters  that  cannot  easily  be  quantified. 
The  relationships  obtained  for  MTBMA  and  NRTS,  however,  are  used  in  con- 
junction with  an  Expected  Back  Order  (EBO)  criteria  to  estimate  the 
quantity  of  spares  and  hence  spares  cost.  The  interested  reader  is 
referred  to  Vol.I  of  this  report  for  a look  at  "The  Design  of  the 
Experiment"  and  the  development  of  the  ALPOS  model,  in  addition  to  the 
approaches  used  to  estimate  spares  costs  ana  support  equipment  costs. 
This  volume  is  mainly  devoted  to  the  Multiple  Regression  Analysis 

• . • • • f • • • 

techniques  used  to  obtain  the  estimating  relationships  for  MMH/OH, 

MTBF,  MTBMA,  LSC/OH,  TRAIN/01I  and  NRTS. 

Since  the  six  parameters  considered  are  major  drivers  of  Operations 
and  Maintenance  Cost,  much  emphasis  has  been  placed  on  finding  the  most 
up  to  date  approach  to  the  subject  of  Regression  Analysis. 
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The  mainr  reference  noted  throughout  this  report  is  a book  written  in 
1971  bv  C.  Daniel  and  F.  S.  Wood  entitled  Fitting  Equations  to  Data  £ij, 
out  of  which  evolved  a most  powerful  computer  program  called  "The  Linear 
Least-Squares  Curve-Fitting  Program"  (LLSCFP) . As  will  be  seen, 
the  sophistication  of  the  approach  and  techniques  used  in  [f] 
is  far  beyond  that  of  any  standard  statistics  books  and  many 
advanced  textbooks  on  Regression  Analysis.  The  innovative  use  of 
"interior"  statistics,  "Indicator"  variables  and  computerized  plots 
are  extremely  helpful  in  leading  a qualified  statistician  in  the 
direction  of  obtaining  the  "best"  estimating  relationships  that  can 
be  obtained  from  a given  set  of  multifactor  data. 

The  proposals  presented  in  have  been  successfully  discussed  in 
seminars  at  many  distinguished  worldwide  universities  as  well  as 
the  Bell  Telephone  Laboratories  and  the  National  Cancer  Institute. 

The  LLSCFP  has  also  been  the  most  sought  after  program  in  both  the 
SHARE  (IBM)  and  VIM  (CDC)  libraries  of  computer  programs,  and  has 
also  been  converted  to  run  in  East  Germany  and  Russia.  These  tech- 
niques have  been  applied  in  a wide  range  of  areas  including 
studies  by  government  agencies  of  variables  for  pollution 
control,  searches  for  influential  variables  which  cause  cancer, 
studies  to  estimate  hospital  costs,  studies  in  the  conservation  of 
energy  and  the  evaluation  of  moon  rocks  at  the  Johnson  Space 
Center.  In  addition  a Bureau  of  Labor  Statistics  study  has  shown  that 
the  coefficients  estimated  by  the  LLSCFP  are  accurate  to  15  digits. 

It  is  felt  then  that  the  proposals  and  techniques  presented  in 
are  the  "state  of  the  art"  in  Regression  Analysis. 

A word  of  caution  however,  is  in  order,  in  that  the  LLSCFP  is  not 
idiot  proof  and  the  cost  analyst  must  remember  that  Regression  Analysis 
is  highly  dependent  on  the  "goodness"  of  the  data  and  maybe  to  a 
greater  extent  on  the  assumed  functional  form  of  the  equation.  For 
if  the  assumed  functional  form  is  incorrect  then  the  statistics  will 
be  misleading,  giving  the  wrong  values  of  the  coefficients  to  be 
estimated,  making  uninf luential  variables  seem  influential  and 
possibly  even  dropping  the  most  influential  variables.  Many  examples 

1 Fitting  Equations  to  Data,  Computer  Analysis  of  Multifactor  Data  for 
Scientist  and  Engineers,  C.  Daniel  and  F.  S.  Wood  with  the  assis- 
tance of  J.  W.  Gorman,  Wiley,  (1971). 
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in  Regression  Analysis  have  an  assured  form  of  the  estimating 
relationship  based  on  a previous  study  or  on  technical  knowledge  of 
the  process  studied.  However,  there  has  been  no  previous  study 
devoted  to  developing  CERs  for  avionics  equipment  in  as  many  as 
21  independent  variables,  nor  is  enough  known  about  avionics  equip- 
ment that  will  lend  to  technical  knowledge  of  the  correct  functional 
form.  We  must  not  stop  here,  but,  should  simultaneously  consider 
all  variables  which  are  "assumed"  to  have  an  influential  effect  on 
the  dependent  variable,  and  let  statistics  and  techniques  lead 
the  analyst  in  the  direction  of  obtaining  the  equations  that 
yield  the  best  possible  predictions.  Many  functional  forms 
can  be  quite  complicated,  and  for  a given  range  of  interest, 
transformations  (such  as,  the  square,  the  natural  logarithm,  the 
exponential,  the  square  root,  etc.)  are  often  used  to  estimate  these 
cases.  To  assist  in  obtaining  the  "best"  possible  equations,  three 


forms  or  transformations  of  the  independent  variables  (namely  the 
variable,  its  square  and  its  natural  logarithm)  and  two  forms  of 
the  dependent  variable  (the  dependent  variable  and  its  natural 
logarithm)  are  used  in  this  report. 

It  is  to  be  emphasized  that  the  independent  variables  are  not  con- 
sidered one  at  a time  or  in  pairs  or  any  other  grouping,  but  that  they 
have  all  been  considered  simultaneously  to  determine  their  compound 
effect  on  the  parameters  to  be  estimated.  It  can  be  easily  shown 
that  a dependent  variable  can  be  highly  correlated  to  one  variable 
and  no  apparent  correlation  exists  between  another  variable, but 
the  compound  effect  of  both  variables  (or  many  variables)  has  a 
significant  effect  on  the  dependent  variable.  Hence  the  practice 
of  using  scatter  diagrams  of  the  dependent  variable  versus  each 


independent  variable  should  not  be  used  in  determining  the  form 
of  the  equation  when  multiple  variables  are  considered. 

Thus,  in  this  study  such  complicated  functional  forms  as: 


and 


y = bQ  + bjXj  + b2x|  + b3lnx3  + ... 


y = bf 


b j x 


,b2X2 


are  considered  as  means  of  estimating  advanced 
y stands  for  the  dependent  variable,  x^'s  the 


b^ ' s are  the  constants, 


and  e the  exponential 


equipment  costs.  Here 
independent  variables , 
function . 
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To  assist  in  verifying  the  accuracy  of  the  equations  obtained, 
there  are  over  thirty  statistics,  five  types  of  plots,  several 
techniques  and  different  tabular  arrangements  of  the  data  that 
are  available  in  the  computer  printouts  of  the  LLSCFP.  This  docu- 
ment includes  a brief  discussion  of  the  concepts  of  Regression 
Analysis  including  the  statistics,  plots  and  techniques  used  to 
estimate  advanced  equipment  costs.  Also  given,  as  an  example, 
are  the  procedures  and  approaches  utilized  to  obtain  the 
parametric  estimating  relationship  for  the  support  parameter 
Mean  Time  Between  Maintenance  Actions  (MTBMA) . 


SECTION  II 


j 


THE  METHOD  OF  LEAST-SQUARES 

The  form  of  the  equations  considered  throughout  this  report  can  be 
written  (or  transformed)  into  the  linear  equation  in  (k  + 1)  - 
unknowns 

y = 6o  + Bixi  + • • • + vK  * (1) 

where  y is  the  dependent  variable,  x,,...,xr  and  the  < - indepen- 
dent variables,  0C  (the  constant)  and  k - coefficient 
make  up  the  unknown  (k  + 1)  population  parameters.  Also  it  is 
assumed  that  there  are  N observations  (pieces  of  equipment)  in  the 
sample  indexed  by  j . Thus  yj  represents  the  (observed)  jth  observa- 
tion of  the  dependent  variable  and  the  jth  observation  of  the 

ith  independent  variable.  Regression  Analysis  requires  that  the 
analyst  find  statistics  b0,b1,..,b(C  which  "best"  approximates  the 
unknown  (k  + l)  population  parameters  (where  we  have  taken  a sample 
from  the  population  of  all  avionics  equipment) , and  whose  fitted 
equation 


Y 


+ bxxx  + 


gives  the  "best"  possible  prediction. 


• + bKXK  (2) 

The  method  most  widely  used 


by  statisticians  to  accomplish  this  is  called  the  method  of  least- 
squares,  which  says: 


"Find  the  values  of  the  constants  in  the  assumed 


equation  that  minimize  the  sum  of  the  squared  deviations 
of  the  observed  values  from  those  estimated  by  the 
equation." 

N 

In  other  words,  minimize  Q = l (y,  - Y.)2,  where  Y,  is  the  estimate 

j =1  3 J J 

of  the  jth  observation  of  the  dependent  variable  obtained  by  (2). 

Once  the  estimates  b0,bx  ,...,bK  are  found,  substituting  the  values 
of  the  independent  variables  in  (2)  yields  the  estimate  of  the 
the  dependent  variable  Y.  We  thus  find  ourselves  in  an  area  of 
statistics  called  "Inductive  Statistics"  which  uses  the  concepts  of 
"Statistical  Inference"  to  make  generalizations  (or  estimates)  of 
population  parameters  based  upon  a given  sample  of  the  population, 
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and  to  quantify  the  reliability  of  the  estimates  obtained.  In 
order  to  make  these  generalizations,  however,  the  data  must  satisfy 
certain  assumptions. 


f 


l 


SECTION  III 

ASSUMPTIONS  OF  THE  METHOD  OF  LEAST-SQUARES 


There  are  four  major  assumptions  which  the  data  must  satisfy  in 
order  to  use  the  techniques  of  least-squares  estimation.  They  are: 


Al. 


A2 . 


A3. 

A4 . 


The  data  is  "good"  data. 

The  correct  form  of  the  equation  has  been  chosen, 
e.g.  , y = 8 0 + ^Xj  + ...  + Bkxk 
The  independent  variables  are  constant,  non-random 
variables,  measured  without  error. 

All  error  is  in  the  observations  of  the  dependent 
variable  y j , i.e. 


Yj  = B0  + *1*1  + + eKXK  + ej 

where  represents  random  error.  Morever,  the  ej  are 
normally  distributed  independent  random  variables  with 
mean  zero  and  constant,  though  unknown,  variance  o2(y). 
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If  all  the  above  four  assumptions  hold  or  "approximately"  hold, 
then  the  least-squares  approach  will  give  the  best  estimates  of 
the  coefficients  in  the  relationships.  Past  experiences,  however, 
indicate  that  slight  departure  from  the  assumptions  of  normality 
and  equal  variances  has  little  effect  on  the  results. 

Since  these  assumptions  are  the  basis  for  the  method  of  least- 
squares  estimation  and  hence  Regression  Analysis,  much  emphasis 
must  be  placed  in  determining  how  close  the  data  fits  the  assumptions. 


SECTION  IV 


ONE  INDEPENDENT  VARIABLE 


Very  often  in  practice  a relationship  connecting  two  variables 
(one  independent  and  one  dependent)  is  desired.  The  equation 
most  widely  used  is  the  linear  equation  (in  two  unknowns 
B0  and  Bj), 

y = B0  + PjXj 


If  all  pairs  of  values  of  Xj  and  y,  when  plotted  in  a scatter  diagram 
on  ordinary  graph  paper,  fall  on  or  near  a straight  line,  equation  (3) 
is  the  correct  form  of  the  relationship  to  be  used.  According  to 
the  least-squares  cirteria,  we  must  use  the  data  to  calculate  statis- 
tics b Q and  bj  which  estimate  the  parameters  S0  and  Blf  and  whose 
fitted  equation  can  be  expressed  by 


y = b 0 + b j x j 


(4) 


In  addition,  the  statistics  bQ  and  bj  must  be  chosen  so  as  to 
minimize  Q where: 


N N 

Q = E (y  - Y.)2  = 


3 = 1 


£ <y<  - b( 


3 = 1 


b x .)  2 
1 13 


(5) 


By  the  techniques  of  differential  calculus,  the  way  to  find  bQ  and 
b which  minimize  Q is  to  take  partial  derivatives  of  Q with  respect 
to  both  b q and  bj  , set  the  results  equal  to  0 and  solve  the  equations 


for  b g and  bj 


Thus,  taking  partial  derivatives  we  obtain 
N 

) = 0 


z (y . - b0  - b,x  . 
3 = 1 J 


and 


(6) 


N 

E 

3 = 1 


E <yj  - b0  - b lx 1 j ) (x!j)  = 0 


Solving  the  first  equation  of  (4)  for  bQ  and  substituting  the  results 
into  the  second  equation  yields  the  following  linear  least-squares 
estimates : 


blxl 


and 


(7) 


y) 


* (X13  " V (yj 

bl  = 

E (x  j - x )2 

j=l 

where  y and  ^ represent  the  arithmetic  mean  of  the  dependent  and 
independent  variables  respectively. 

Scatter  diagrams,  however,  do  not  always  give  an  indication  of  an  linear 
relationship,  but  show  some  evidence  of  curvature.  For  instance,  the 
graph  might  indicate  that  the  form  of  the  equation  is  a parabola,  i.e., 
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y = Bo  + Bj  Xj  + Bj  x 2 


(8) 


If  a linear  equation  is  fitted  to  data  that  is  better  represented 
by  a curvilinear  equation,  then  assumption  A2  is  volidated  and  the 
results  can  be  spurious. 

The  use  of  a special  type  of  graph  paper  (for  which  either  one  or 
both  of  the  variables  are  calibrated  logarithmically)  called  semi- 
log or  log-log  graph  paper,  is  also  helpful  in  choosing  other  forms 
of  the  relationship. 

If  in  a scatter  diagram  plotted  on  semi-log  paper,  the  observations 
fall  near  a straight  line,  then  the  exponential  curve, 

y = 60eeixi  , (9) 

is  the  appropriate  choice.  If  a straight  line  is  obtained  on  log- 
log  paper,  the  geometric  curve, 

y = Bo*/1  , (10) 

is  appropriate.  Taking  the  natural  logarithm  of  both  sides  of  (9) 
and  (10)  yields 


lny  = In  B0  + B}  Xj 


and 


(ID 


lny  = In  60  + B,  lnx 


respectively.  Using  the  simple  transformation  y'  = lny,  x'  = lnxj 
and  Bq  = lng0,  equation  (11)  is  transformed  into 


6 ; 


+ BjXj 


and 


po’ 


+ B,x; 


(12) 


which  are  both  similar  in  form  to  equation  (3) , and  whose  coefficients 
can  be  estimated  by  (5).  Thus,  any  functional  forms  (e.g.  (8)  or  (9)) 

which  can  be  linearized  by  simple  transformations  fall  under  the  realm 
of  least-square  estimation  techniques.  Many  other  plots  of  linearizable 
equations  (see  and  (2])  are  also  useful  in  finding  the  correct  func- 
tional relationship  to  be  used  when  only  one  independent  variable  is 
considered  to  be  influential. 


2 "Fitting  Curves  to  Data,"  A.  E.  Horel,  Chemical  Business  Handbook 
(edited  by  J.  H.  Perry),  McGraw-Hill,  1954,  Section  20,  pp.  55-77. 
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SECTION  V 

MULTIPLE  REGRESSION  ANALYSIS 

When  two  or  more  independent  variables  are  considered  in  a regression 
exercise,  scatter  diagrams  and  other  graphical  methods  are  often 
useless  when  trying  to  determine  the  form  of  the  assumed  equation. 

For  instance,  if  the  two  independent  variables  and  x2  are  considered, 
an  x 1 - y scatter  diagram  might  indicate  a high  linear  relationship 
whereas  the  x,  - y and  x}  - x2  scatter  diagram  may  show  no  apparent 
correlation,  even  though  the  true  form  after  equation  is 

y = 60  + eix1  + e2x2  . 

Therefore,  graphical  techniques  are  not  considered  as  an  alternative 
to  finding  the  correct  form  of  the  equation  to  be  tested  (as  required  by 
assumption  A2) . If  the  correct  form  is  not  known,  the  analyst  should 
try  several  forms  of  the  equation,  and  let  the  statistics  verify 
the  correct  form. 

THE  "GLOBAL " STATISTICS 

As  stated  previously,  in  addition  to  estimating  the  coefficients,  a 
means  is  needed  to  determine  how  "good"  these  estimates  are.  The 
statistics  used  to  verify  the  "goodness  of  fit"  of  the  relationships 
will  be  briefly  defined  with  some  general  comments.  The  capital 
letters  correspond  to  the  respective  names  of  these  statistics  as 
listed  in  the  computer  printouts  of  the  LLSCFP. 

We  initially  begin  with  a few  elementary  statistics  which  are  quite 
helpful.  They  are  the  sums,  means,  maximums,  minimums,  ranges  and 
standard  deviations  of  the  variables  (both  independent  and  depen- 
dent) . With  these  statistics,  the  analyst  can  get  a good  indication 
of  the  distributional  properites  of  the  variables. 

SUMS  OF  VARIABLES 

N 

The  computer  LLSCFP  lists  the  sum,  .1  x,.,  for  each  independent 

N J=1 

variable  and  ^2^  y^  for  the  dependent  variable. 
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MEANS  OF  VARiA  AES 


The  arithmetic  mean  of  each  independent  and  dependent  variable 
denoted  by 

N 

Z x . . 
j-i 


and 


y = 


N 


N 

1 yj 
j = l J 


N 


are  listed  under  this  heading. 


ROOT  MEAN  SQUARES  OF  VARIABLES 

The  root  mean  squares  of  the  variables  (also  called  the  standard 
deviation)  is  a statistic  that  can  give  an  indication  of  the  spread 
or  variation  of  each  variable,  independent  and  dependent,  in  the 


data  and  is  denoted  by 


xi 


Ji  <x^  ' ^ 

N - 1 


and 


N 

" (y i - y)2 
j = i J 


N - 1 

respective ly . 

MAX  X(I) 


The  maximiun  value  of  the  ith  independent  variable. 


M [N  X ( X ) 

The  minimum  value  of  the  ith  independent  variable. 

RANGE  X ( I ) 

The  range  of  the  ith  independent  variable,  i.e.,  the  maximum  value 
minus  the  minimum  value. 
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MAX  Y 


The  maximum  value  of  the  dependent  variable. 


MIN  Y 

The  minimum  value  of  the  dependent  value. 


RANGE  Y 

The  range  of  the  dependent  variable. 

Many  times  these  elementary  statistics  can  give  a quick  indication 
that  something  is  wrong  with  the  data  (e.g.,  an  impossible  maximum 
value) . Recall  Assumption  A1  specifies  that  the  data  is  "good"  data. 
The  simple  statistics  can  be  helpful  in  pinpointing  which  particular 
variables  are  causing  such  things  as  outliers  (impossible  values) 
to  develop  in  the  results. 


COEFFICIENT  B (I) 


The  least-squares  estimates  of  a multiple  regression  equation  are 
obtained  in  a similar  fashion  as  those  estimated  for  the  linear- 
equation  in  one  independent  variable.  The  partial  derivative  of  Q 
with  respect  to  the  constant  bQ  is  taken  and  set  to  zero  and  solved 
for  bQ.  This  result  is  then  substituted  into  Q where  partial  derivatives 
of  Q with  respect  to  each  of  the  tc  - coefficients  are  taken  and  set  to 
zero,  thereby  yielding  a k * < system  of  equation  in  k - unknowns. 

This  system  of  equations  is  then  solved  by  the  methods  of  determinants 
to  determine  the  desired  estimates  of  the  coefficients.  The  linear- 
least  squares  estimates  are: 

bo  = y ~ i=i  bi*i 


and 


N 

I 

j = l 


{ c 


1 1 


(X 


1 j 


*l)  + C i 2 


(x2j  - x2)  + 


+ c 


Ik 


(x 


KJ 


- 


where  c ^ . is  the  element  of  the  inverse  matrix  (obtained  by  the  method 
of  determinants)  belonging  to  the  ith  row  of  the  jth  column.  The  LLSCFP 
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lists  the  constant  bQ  and  a coefficient  for  each  independent  variable. 
The  coefficient  b£  can  be  used  to  determine  the  influence  of  variable 
xj  on  the  fitted  equation. 


RESIDUAL 

The  residual  is  defined  as  the  difference  between  the  observed  value 
of  the  dependent  value  and  the  value  estimated  by  the  prediction 
equation,  i.e.,  yj  - Yj . Note  that  this  simple  statistic  is  the  basis 
for  least-squares  analysis.  Once  the  prediction  equation  is  obtained, 
the  residuals  show  how  well  the  equation  estimates  the  dependent 
variable  for  each  observation  (piece  of  equipment)  in  the  data  base. 

FITTED  Y 

The  statistic  Yj  is  called  the  fitted  y-value  of  the  j th  observation 
of  the  dependent  variable.  This  is  the  value  of  the  dependent  variable 
estimated  by  the  prediction  equation  for  each  observation 
in  the  data  base. 

TOTAL  SUM  OF  SQUARES 

The  total  sum  of  squares  in  an  initial  step  is  trying  to  get  a grip 
on  the  error  of  prediction.  It  is  a measure  of  the  total  variation 
in  the  dependent  variable.  It  is  proportional  to  S2,  the  variance 
of  y and  is  defined: 

N 

TOTAL  SUM  OF  SQUARES  = I (y  - y)2 

j = l J 

The  total  sum  of  the  squares  can  be  partitioned  into  two 

useful  sums,  the  sum  of  the  squares  due  to  the  fitted  equation  and 

the  residual  sum  of  the  squares, i.e. 

N N N 

l (y  - y)2  = I (Y.  - y)2  + Z <y,  - Y,)2 

J-l  j=l  J j=l  J 

SUM  OF  SQUARES  DUE  TO  THE  FITTED  EQUATION 

The  sum  of  the  squares  due  to  the  fitted  equation, 
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SSFE 


N 

-I  (Y,  - y)2 
j = l 3 

is  that  part  of  the  total  variation  in  the  dependent  variable  y that 
can  be  attributed  to  the  fitted  equation.  A large  SSFE  indicates 
that  the  equation  used  is  accounting  for  most  of  the  variation. 

RESIDUAL  SUM  OF  SQUARES 

The  residual  sum  of  the  squares  is  that  part  of  the  total  variation 
that  cannot  be  attributed  to  the  fitted  equation  (such  things  as 
experimental  error,  chance,  function  bias,  i.e.  , having  the  wrong 
form  of  the  equation,  or  other  biases) , and  is  defined  as 

N 

RESIDUAL  SUM  OF  SQUARES  = E (y.  - Y.)2 

3 = 1 J J 

RESIDUAL  MEAN  SQUARE 

Recall  Assumption  A4  states  that  the  y-observations  have  the  same  con- 
stant though  unknown  variance  a2(y),  and  hence,  we  need  a means  of 
estimating  this  variance  of  prediction.  One  estimate  of  o2(y)  called 
the  residual  mean  square  (or  variance),  denoted  by  S2(y),  is  defined  as 

N 

I ( y j - Yj)2 
3=1 

RESIDUAL  MEAN  SQUARE  = — 

N - k - 1 

where  N - k - 1 is  the  RESIDUAL  DEGREES  OF  FREEDOM.  The  degrees  of  free- 
dom of  a statistic  is  the  number  of  independent  bits  of  observations  minus 
the  number  of  parameters  estimated  in  the  calculation  of  the  statistic. 

RESIDUAL  ROOT  MEAN  SQUARE 

The  square  root  of  the  residual  mean  square  is  called  the  RESIDUAL 
ROOT  MEAN  SQUARE  (or  standard  deviation) . This  can  be  interpreted  in 
a similar  way  as  the  standard  deviation  of  the  prediction  equation. 

The  residual  root  mean  square  is  also  called  the  Standard  Error  of 
Estimate . 
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STANDARD  ERROR  OF  THE  COEFFICIENT 


Since  each  of  the  coefficients  obtained  by  the  method  of  least-squares 
are  only  estimates,  the  accuracy  and  importance  of  each  estimate  must 
be  shown.  From  the  equation  which  calculates  b^,  it  can  be  shown  that 
the  variance  of  the  bi?  denoted  by  S2(bi)  can  be  written  in  the  form 


S2(b.)  = S 2 ( y ) C.. 

where  Cii  is  the  ith  diagonal  element  of  the  inverse  matrix.  The 
square  root  of  the  variance  of  b^,  SCb^),  is  called  the  standard  error 
of  the  coefficient  and  is  calculated  by  the  formula 

k 

S.F.COEF.  = S (y)  ( C ± ±)  . 

The  standard  error  of  the  coefficient  gives  an  indication  of  the 
"inherent"  precision  of  the  coefficient  estimated.  If  a coefficient 
is  large,  say  1,000  with  a standard  error  of  .1,  we  could  say  that 
the  coefficient  is  estimated  with  great  precision.  However,  a coef- 
ficient of  say  .2  with  a standard  error  of  . 1 is  obviously  not  as 
precise  an  estimate.  Therefore,  a means  of  determining  the  relative 
accuracy  of  each  coefficient  is  needed. 

T -VALUE 

The  statistic  used  to  measure  the  relative  accuracy  of  each  of  the 
coefficients  is  called  the  T-VALUE  (denoted  by  ti)  and  is  defined  by 


b i COEF  B ( I ) 

T-VALUE  = = 

S ( b ±)  S.E.COEF. 

The  LLSCFP  lists  a T-VALUE  for  each  coefficient.  The  larger  the  coef- 
ficients ti  - value  (as  compared  with  the  other  t-values) , the  more 
chance  variable  xi  will  be  in  the  final  fitted  equation. 

RELATIVE  INFLUENCE  OF  EACH  INDEPENDENT  VARIABLE 

A statistic  called  the  relative  influence  of  describes  the  fraction 

of  the  total  change  in  Y that  can  be  accounted  for  by  the  accompanying 
total  change  in  the  ith  independent  variable  and  is  defined  as 


REL. INF.X (I) 


biwi 


t 
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where  H COEF  b(I)  is  the  coefficient,  w..  = RANGE  X(I)  is  the  range 
of  the  independent  variable  and  wy  is  the  range  of  the  dependent  variable. 


MULTIPLE  CORRELATION  COEFFICIENT  SQUARED 

The  total  sum  of  the  squares,  sum  of  squares  due  to  the  fitted 
equation  and  residual  sum  o^  the  squares  will  have  different  values 
for  different  equations.  The  most  widely  used  statistic  which  gives 
a relative  measure  of  the  "goodness  of  fit"  of  the  assumed  equation 
is  the  multiple  correlation  coefficient  squared  (denoted  by  R2  ) where 


MULT. CORREL. COEF. SQUARED  = 


SSFE 


TOTAL  SUM  OF  SQUARES 


The  multiple  correlation  coefficient  squared  (also  called  the  coefficient 
of  determination)  is  defined  as  that  fraction  of  the  total  sum  of  the 
squares  tnat  can  be  attributed  to  the  fitted  equation.  If  R2  = I we 

are  fortunate  to  have  a ’perfect" fit  and  if  R2  = 0 the  fitted  equation 
does  not  fit  the  data  at  all.  In  most  cases  the  multiple  correlation 
coefficient  squared  will  fall  between  the  values  of  0 and  1,  and 
here  interpretation  is  necessary. 

For  instance,  if  a straight  line  is  fitted  to  a pair  of  data  points 
when  the  correct  form  of  the  equation  should  be  exponential  (i.e., 

a logged  dependent  variable) , a low  R2  does  not  indicate  that  there 

y 

is  "no"  relationship  between  the  data,  but  that  the  relationship 
used  does  not  adequately  represent  the  data.  Thus  the  multiple  cor- 
relation coefficient  squared  measures  the  degree  of  the  relationship 
relative  to  the  equation  used.  Scmetimes,  however,  a large  R2  such 
as  .90  can  occur  when  the  wrong  form  of  the  equation  is  chosen  (examples 
will  be  given) . Therefore  a specific  value  of  R2  that  indicates  a "good" 
fit  is  not  given.  Although  the  multiple  correlation  coefficient 
squared  is  the  most  widely  used  statistic  that  measures  the  accuracy 
of  the  relationships,  it  means  very  little  to  this  analyst  when 
considered  alone.  It  should  be  considered  in  conjunction  with  all 
other  statistics,  graphical  representations  of  goodness  of  fit,  and 
techniques  used  to  determine  the  stability  of  the  equations. 
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F- VALUE 


A statistic  used  to  determine  the  "significance"  of  R 2 is  the  statistic 
called  the  F-VA.LUE.  The  F-VALUE  is  usually  used  to  judge  the  equiv- 
alence of  two  independent  estimates  of  variance.  It  can  be  easily 
be  shown  that 


N 


7.x  2 


R2  = 


E (Y.  - y) 
j = l J 

N 

r (y.  - y ) 2 
j=i  J 


= l - j=i 


1 (y,  - y.: ) 


N 

i (y,  - y)2 
j = i 3 


and  therefore  indicates  that  the  two  estimates  of  variance 


N 

i ( y . - y)2 

j = i J 

K 


and 


N 

E (y,  - Y.)2 
j=l  3 3 


N - k - 1 


should  be  helpful  in  determining  the  significance  of  R2,  where  k and 
N - k - 1 are  degrees  of  freedom  respectively. 


Hence,  the  F-VALUE  is  defined  by  the  variance  ratio, 

N 

r 


F-VALUE  = 


j = l (Yi  - ^ 


(yj  - YJ)2 
N - k - 1 


N - K - 1 


\ v Yi 

' XT 


( Yi  - y): 


N 

1 (yj 
j = l J 


Yj 


which  indicates  that  the  larger  SSFE  is  with  respect  to  RESIDUAL  SUM 
OF  SQUARES,  the  larger  the  F-VALUE.  An  equivalent  form  of  the  F-VALUE  is 


F-VALUE 


K 


K 
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If  R 2 is  close  to  1 then  1 - R2  is  close  to  0 and  the  F-VALUE  is 

y y 

large.  Of  course  the  question  must  be  answered,  "How  large  is  large 
enough?"  This  question  can  be  answered  by  performing  the  statis- 
tical hypothesis  test  called  the  F-Test. 

In  performing  a statistical  test,  a level  of  significance  is  usually 
assumed,  denoted  by  the  Greek  letter  a.  The  level  of  significance 
is  the  amount  of  risk  that  the  analyst  is  willing  to  take  in  rejecting 
a true  hypothesis.  The  values  of  a which  are  usually  assumed  are 
.05  (The  test  would  be  called  "probably  significant"  and  further 
experimentation  may  be  in  order.)  and  .01  (The  test  is  called  "highly 
significant.").  Throughout  this  report  a = .01  is  the  assumed  level 
of  significance. 

The  F-Test  compares  the  F-VALUE  with  values  from  an  F-Table  (in  most 
standard  statistics  books)  with  k and  N - k - 1 degrees  of  freedom, 
to  give  a joint  test  of  the  hypothesis  that: 

"all  the  coefficients  of  the  fitted  equation  are  0" 

(indicating  a bad  fit) . 

against  the  alternative  that 

"the  equation  as  a whole  produces  a significant  reduction  in 
the  total  sum  of  square"  (indicating  a-good  fit) . 

Suppose  for  a given  fit  the  F-VALUE  = 10.8  where  there  are  N = 15 
observations  and  k = 3 independent  variables  (a  = .01).  From  an 
F-table,  a value  (based  on  k = 3 numerator  degrees  of  freedom  and 
N - k - 1 = 11  denominator  degrees)  of  6.22  can  be  extracted.  Since 

10.8  is  greater  than  6.22  the  fit  is  considered  significant.  The 
larger  the  F-VALUE  is  with  respect  to  its  associated  tabular  value, 
the  more  significant  the  fit. 

SIMPLE  "LINEAR"  CORRELATION  COEFFICIENT 

A statistic  (similar  to  the  multiple  correlation  coefficient  squared) 
which  gives  an  indication  of  how  any  two  variables  are  linearly 
related  is  the  simple  "linear"  correlation  coefficient 
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1 2 


j=l 


Xlj  ' Xl>  (X2j  “ X 2 } 


(xn  - X1)2  j,  (x2j  " x2>2 


which  is  a measure  of  the  linear  interdependence  between  any  two 
variables.  The  simple  linear  correlation  coefficient  is  a number 
between  -1  and  +1,  where  r12  = 1 indicates  positive  linear  correla- 


tion, r = -1  indicates  a negative  linear  correlation,  and  rJ2 


= 0 


indicates  that  there  is  no  linear  relationship  between  the  two 
variables.  Again,  note  that  r = 0 does  not  indicate  that  there 
is  "no"  correlation  between  the  variables,  but  that  no  linear 
correlation  exists. 

R ( I ) SQRD 

There  are  obviously  many  problems  in  which  more  than  two  independent 
variables  are  assumed  to  be  influential,  and  it  is  very  informative 
to  see  how  all  the  independent  variables  are  linearly  related  to  each 
other.  A statistic  (which  is  a generalization  of  r ) that  gives  an 
indication  of  how  the  ith  independent  variable  is  linearly  related 
to  all  the  other  (k  - 1)  - independent  variables,  is  the  squared 
multiple  "linear"  correlation  coefficient  R?.  Here  with  variable  xj^ 
as  the  dependent  variable  and  the  remaining  (k  - 1)  independent 
variables,  we  fit  a simple  linear  equation  and  calculate  the  multiple 
correlation  coefficient  squared  which  is  R? . Thus  R?  measures 
the  degree  of  linear  dependence  of  xi  or  the  other  x^i,  i'  4 i,  where 
Rf  = 1 indicates  that  a strong  linear  relationship  between  x.^  and  the 
other  Xji 's.  If  only  two  independent  variables  are  considered  then 


R?  = r 


= r • 


= R2 


It  can  be  shown  that  C 


ii 


12  ‘21 

(the  element  of  the  inverse  matrix  in  the 


ith  row  and  j th  column)  may  be  written  as 


•ii 


N 

t 

j = l 


(x 


ij 


- v ^2 
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This  leads  to  a relationship  between  the  standard  error  of  the 
coefficient  s(b.)  (S.E.COEF)  and  the  squared  multiple  linear 

correlation  coefficient,  R 2 ( R ( I ) SQUARED ) , namely 


s (b1) 


(y) 


j = l 


(x  . . 
ij 


x.)2  (1  - R?) 

1 1 


N 

With  the  form  of  the  equation  chosen,  s(y)  and  E (x..  - x )2 are 

j = l 1J  1 

constant  and  determined  by  the  data,  and  R^  therefore  determines  the 
size  of  s ( b ^ ) . If  R2  is  close  to  1,  then  1 - R?  is  close  to  0 which 
increases  the  size  of  the  standard  error  of  the  coef f icient,  s (b i) . 

A large  s(b{)  will  result  in  a small  t^-value,  thus  possibly  dropping 
xi  from  the  set  of  influential  variables  to  be  used  in  the  final  pre- 
diction equations.  Some  independent  variables  are  simply  uninf luentis 1 
in  their  effect  on  the  dependent  variable  and  will  be  dropped  (by 
using  a technique  called  the  Cp-search  technique) , while  other  indepen- 
dent variables  may  be  so  highly  correlated  (linearly)  to  the  remaining 
independent  variaoles  that  their  effects  on  the  dependent  variable 
can  be  explained  by  tne  remaining  variables. 


THE  D- STATISTIC 

Many  times  the  statistics  and  plots  will  indicate  some  form  of  lack 
of  fit  of  the  assumed  equation.  This  lack  of  fit  may  be  caused  by 
having  the  wrong  form  of  the  equation,  sometimes  called 
function  bias.  For  instance,  we  may  be  fitting  a straight  line  to  a 
set  of  paired  data  when  actually  a parabola  is  the  correct  form, 
i.e.,  a squared  term  is  needed  in  the  equation.  If  the  statistics 
and  plots  indicate  that  curvature  in  a particular  variable  x^^  is 
needed,  x2  can  be  used  as  an  additional  independent  variable.  Usually 
( x i - x.)  and  (x.  - x, )2  can  be  used  as  independent  variables  to 
reduce  the  high  correlation  between  x.  and  x?,  resulting  in  an  unwanted 
large  R £ . Sometimes,  however,  the  mean  x does  not  sufficiently 
reduce  the  high  correlation  between  a variable  and  its  square.  A 
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statistic  (called  the  d-statistic)  due  to  0.  Dykstra  [1],  which  re- 
quires that  the  covariance  between  (x^  - x)  and  (x.  - d^)2  to  be  zero, 
can  be  used  to  reduce  such  high  correlations.  This  reduces  to  solving 


N 

(xij  ' Xi>  (xij  • di)2 


for  d ^ , which  yields 


N 

l  x2  (x,.  - x) 


di  = 


j = l 


ij  '“ij 


2  E (x  -x.) 


j = l 


ij  V 


0 


In  this  case  we  use  (x^  - x)  and  (x^  - d^)2  as  the  independent 
variables.  If  x^  is  used  instead  of  d^,  a large  R?  for  either  the 
variable  or  its  square  may  occur,  thereby  possibly  dropping  the  variable 
and/or  its  square  as  being  uninf luential  variables,  when  in  fact  both 
may  be  influential  variables.  Through  this  report,  the  d^-statistic 
is  used  if  there  is  an  indication  of  curvature  in  the  relationships. 

THE  CP-STATISTIC,  P = k + 1 

In  many  cases  when  multiple  variables  are  considered,  not  enough 
previous  work  has  been  done  in  the  area  of  study  to  be  sure  that  all 
the  independent  variables  are  influential,  but  that  possibly  a subset 
collection  of  the  variables  fit  the  data  "better"  or  as  "good"  as 
the  initial  set.  If  there  are  k = 18  independent  variables,  then 
there  are  218  = 262,144  possible  combinations  of  variables  whose 
equation  must  be  compared. 

A major  innovative  statistic  due  to  C.  Mallows  ([1]  and  [3])  called  the 
Cp-statistic,  is  used  as  a measure  of  "goodness  of  fit"  to  compare  all 
the  possible  2K  combinations  of  equations  for  the  "set"  of  equations 
which  best  fits  the  data.  The  Cp-statistic  represents  the  "total  squared 
error  (random  squared  error  and  bias  squared  error) " and  is  defined  as 


3  "Choosing  Variables  in  a Linear  Regression:  A Graphical  Aid," 

C.  L.  Mallows,  presented  at  the  Central  Regional  Meeting  of  the 
Institute  of  Mathematical  Statistics,  Manhattan,  Kansas,  May  7-9,  1964. 
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c = RESIDUAL  SUM  OF  SQUARES  _ N + 2p 
P RESIDUAL  MEAN  SQUARE 

A Cp-statistic  is  calculated  for  each  equation.  Note  that  for  every 
variable  dropped,  the  Cp-statistic  can  decrease  by  at  most  2 units. 

For  the  derivation  of  C see  QJ  and  PJ . 

Ir 

PRECISION  OF  X(I) 

This  statistic  is  a new  statistic  not  now  in  the  statistical  cannon. 

A paper  to  be  published  in  the  near  future  by  F.  S.  Wood  will  intro- 
duce it  and,  hence  will  not  be  discussed  here. 

THE  "INTERIOR"  or  "LOCAL"  STATISTICS 

All  the  previous  statistics  discussed  fall  under  the  heading  of 
"global"  statistics  in  that  they  are  statistics  of  the  entire  set  of 
data.  The  ’global"  statistics  are  helpful  in  determining  ho v .the  indepen- 
dent variables  influence  the  fitted  equations,  but  they  do  not  describe 
how  the  observations  (the  interior  of  the  data)  in  multifactor  space 
affect  the  fit.  Four  innovative  "interior"  statistics  which  have  been 
developed  (seepj)  can  assist  in  such  things  as  detecting  outliers; 
indicating  observations  which  may  influence  the  form  of  the 
equation  (possibly  introducing  curvature) ; detecting  those  obser- 
vations which  have  the  largest  effect  on  the  assumed  equation, 
finding  those  observations  which  are  taken  approximately  under  the 
same  Xp-conditions  (called  nearby  neighbors)  and  in  testing  the 
validity  of  the  "global"  statistics.  These  nearby  neighbors  are  used 
to  obtain  a less  biased  estimate  of  our  error  of  prediction,  a2  (y) . 

The  interior  statistics  defined  are  weighted  by  the  bp  - values  so 
as  to  reduce  the  effects  of  uninf luential  factors. 

WSS  DISTANCE  (Weighted  Squared  Standardized  Distance) 

An  observation  whose  points  are  at  the  extreme  ends  of  the  independent 
variables  are  usually  far  from  the  "centroid"  of  all  the  observations 
of  the  data.  For  instance,  a piece  of  equipment  may  have  a much  larger 
(or  smaller)  weight  than  all  the  rest  of  the  data.  A statistic 
called  the  Weighted  Squared  Standardized  Distance  (WSS  DISTANCE) , 
defined  by 
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K bl  <*ij  - *1>  2 

WSS  DISTANCE  = 1 , 

i = l 

s (y) 

is  useful  in  detecting  those  observations  which  are  far  from  the 
centroid  of  all  the  observations.  These  observations  (with  large 
WSS  DISTANCE)  may  indicate  that  outliers  are  present  or  possibly  that 
curvature  is  needed  in  the  equation. 

COMPONENT  EFFECTS,  C^ 

The  statistic  C^  .,  where 

cij  = bi  <xij  " xi>  ' 

may  be  defined  as  the  component  effects  of  x^  (the  ith  independent 
variable)  on  y^  (the  fitted  value  of  observation  j).  A nice  tabular 
arrangement  (See  'COMPONENTS  EFFECTS"  TABLE)  of  the  components  effects 
assists  the  analyst  in  determining  the  influence  that  each  particular 
observation  has  on  the  fitted  equation.  This  is  helpful  in  the 
analysis  of  trade  studies. 


WSSD 


In  some  cases  the  observations  may  have  nearly  the  same  - values 
(for  instance  similar  pieces  of  equipment)  and  can  be  considered 
as  being  "close"  to  each  other  in  multidimensional  factor  space 
(nearby  neighbors) . The  statistic 


WSSD j j , 


K 

l 

i = l 


bi  (xij  ~ xij')  2 
S(y) 


s'(y)  i=l 


(C  . . 

1J 


measures  the  (squared)  distance  in  "effect  space"  between  two 
observations  j and  j'. 


This  statistic  is  also  helpful  in  detection  of  what  is  known  as 
"Nested  Data."  If  the  analyst  determines  that  his  data  is  nested, 
additional  things  must  be  considered  to  find  the  correct  form  of  the 
eauations  (See  DU). 


25 


CUMULATIVE  STANDARD  DEVIATION  ESTIMATED  FROM  NEAR  NEIGHBORS 


Recall  that  as  a "global"  estimate  of  random  error  of  prediction, 
the  residual  mean  square  (variance)  or  the  residual  root  mean  square 
(standard  deviation)  was  used.  Sometimes  the  inner  characteristics 
of  the  data  may  indicate  that  these  are  not  very  good  estimates.  The 
statistic  WSSD  identifies  those  near  neighbors  which  are  used  to 
obtain  a less  biased  running  estimate,  Sn , of  S(y)  called  the  stan- 
dard deviation  estimated  from  near  neighbors  where 

.886  (l  And) 

Sn  = 

n 

Here  A d is  the  absolute  value  of  the  differences  of  the  residuals 

n 

of  the  neighboring  observations,  and  the  value  .886  = 1/1.28  is  used 
since  the  expected  value  of  the  range  for  pairs  of  independent  obser- 
vations from  a normal  distribution  is  1.128.  If  the  residual  root 
mean  square  is  close  to  the  successive  estimates  of  cumulative 
standard  deviation  S , there  is  then  no  evidence  of  lack  of  fit  of 
the  proposed  equation. 

STATISTICAL  PRINTOUTS  AND  TABLES 

The  statistics  are  printed  out  in  an  orderly  manner  which  the  analyst 
can  use  to  further  evaluate  the  fit  of  an  equation.  A partial  print- 
out of  the  statistics  on  the  equation  obtained  for  the  fit  of  the 
dependent  variable  MAINTENANCE  MANHOURS  PER  OPERATING  HOUR,  where 
Y 1 = MMH/OH,  is  shown  in  Figures  1 to  4 . Figure  1 shows  many  of 
the  global  statistics,  including  the  coefficients  b^,  S(b^),  t^  and 
the  relative  influence  of  x^  for  each  independent  variable  x^.  In 
addition  R^ , the  F-VALUE  and  the  residual  root  mean  square  are 
displayed. 

OBSERVATIONS  ORDERED  BY  COMPUTER  INPUT  AND  BY  RESIDUALS 

As  shown  in  Figure  2,  under  the  heading  "ORDERED  BY  COMPUTER  INPUT," 

the  residuals  are  listed  in  the  order  in  which  the  observations 

were  given  to  the  computer.  The  Work  Unit  Code  (WUC)  of  each  piece  of 
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equipment  is  given  in  the  first  column  for  identification  purposes. 

The  third  column  shows  the  WSS  DISTANCE  for  each  observation.  Here 
those  observations  far  from  the  centroid  of  all  observations  can  be 
easily  spotted  (observation  47  and  22).  Under  the  heading  "ORDERED  BY 
RESIDUALS the  residuals  are  listed  in  the  order  of  the  magnitude 
of  the  residuals.  This  gives  an  indication  of  which  observations 
are  fitted  the  best  (or  worst).  As  can  be  seen,  observations  46  and  13 
are  fitted  best  and  61  is  fitted  worst. 

STANDARD  DEVIATION  ESTIMATED  FROM  RESIDUALS  OF  NEIGHBORING  OBSERVATIONS 

The  cumulati\u  estimates,  S , of  the  standard  deviation  are  printed 
in  the  second  column  of  Figure  3.  The  WSSD^j,  of  the  observations  ' ' 
in  columns  4 and  5 are  printed  in  the  third  column.  Also  the  observa- 
tions are  ordered  by  their  increasing  fitted  y values.  At  the  top  of 
Figure  3 is  the  residual  root  mean  square  = .03.  The  cumulative 
standard  deviation  column  indicates  that  the  standard  deviation 
estimated  from  near  neighbors  is  approximately  .03,  hence  there  is 
no  evidence  of  lack  of  fit. 

"COMPONENT  EFFECTS"  TABLE 

The  component  effect,  C^ j , of  each  variable  on  each  observation  is 
printed  in  tabular  form  (Figure  4)  where  the  variables  are  ordered  by 
their  decreasing  relative  influences  in  columns,  and  the  observations 
are  ordered  by  their  decreasing  effects  on  the  most  influential 
variables  in  rows.  Here  the  analyst  can  see  which  particular  observa- 
tions are  most  influential  in  their  effects  on  the  fitted  equation. 

In  addition  this  table  can  be  used  to  determine  the  importance  of 
high  correlation  among  the  independent  variables. 

s 

STATISTICAL  PLOTS 

As  with  any  endeavor  dealing  with  Deductive  Reasoning,  the  conclusions 
are  dependent  on  the  validity  of  the  assumptions.  Thus  the  analyst 
must  have  some  means  of  verifying  the  degree  to  which  the  assump- 
tions are  satisfied.  In  addition  to  the  number  of  statistics  and 
statistical  tables,  there  are  five  types  of  computerized  plots  that 
can  be  used  to  determine  how  close  the  data  and  fitted  equations 
satisfy  the  assumptions.  These  plots  give  the  analyst  much  insight 
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into  the  fit  that  the  statistics  alone  cannot.  The  plots  are  used  to 
determine  (1)  whether  the  assumptions  of  the  method  of  least-squares 
are  "nearly"  satisfied,  (2)  just  how  well  (or  bad)  the  equation  fits 
the  data,  and  (3)  to  obtain  further  insights  into  the  distributional 
properties  of  the  data  and  how  these  properties  affect  the  fit. 

CUMULATIVE  DISTRIBUTION  OF  RESIDUALS 

When  k independent  variables  are  fitted  to  data  with  normally 
distributed  error  (Assumption  A3) , it  can  be  shown  that  the 
residuals  also  have  a Normal  distribution.  Therefore,  the  graph  of 
the  residual  versus  cumulative  frequency  should  be  "nearly"  a straight 
line.  This  plot  is  helpful  in  determining  whether  the  data  satisfies 
Assumptions  Al,  A2  and  A4 . Figure  5 is  a plot  for  the  initial  fit  of 
ln(MTBF).  Obviously  there  is  an  observation  whose  residual  is  separated 
from  the  rest  of  the  data.  This  observation  may  be  an  outlier 
(violating  Assumption  Al)  or  mav  indicate  that  some  form  of  curvature 
is  needed  in  the  equation  (Assumption  A2) . After  investiaation  it  vas 
determined  that  the  point  was  indeed  an  outlier.  Figure  6 is  the 
cumulative  frequency  plot  for  the  fitted  equation  obtained,  with  In 
(MTBF)  as  the  dependent  variable.  There  is  no  indication  here  of 
deficiencies  in  the  fit. 

RESIDUALS  VS  FITTED  Y 

The  plot  of  the  residuals  versus  the  fitted  values  of  the  dependent 
variable  is  also  helpful  in  checking  Assumption  Al,  A2  and  A4.  This 
plot  may  show  whether  there  is  some  dependence  of  the  magnitude 
of  the  residuals  on  the  magnitude  of  the  fitted  values.  Daniel 
and  Wood  [l  ] , gives  four  common  defects  that  may  be  revealed  by 
such  plots.  Recall  Assumption  A3  states  that  the  variance  of  the 
error  is  constant.  The  plot  of  residual  versus  fitted  Y should  then  show 
an  equal  scatter  about  the  O-residual  line.  Figure  7 is  a plot  for  the 
initial  fit  of  In (MTBF ) . As  in  the  cumulative  frequency  plot,  (Figure  5) 
one  observation  is  seDarated  from  the  remainder  of  the  data.  Both  the 
cumulative  frequency  plot  and  the  plot  of  residual  vs.  fitted  y are 
necessary  be  *de tesm-ir*e  whether  a point  is  ‘an  outlier.  Again,  if 
this  observation  is  at  the  extreme  ends  of  the  ranges  of  the  dependent 
variable,  curvature  may  be  the  solution.  Figure  8 is  a plot  of 
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FIGURE 


residuals  versus  fitted  Y for  the  equation  obtained  for  ln(MTBF).  The 
equal  scatter  of  the  residuals  do  not  indicate  deficiencies  in  tne 
equation  obtained  (as  was  the  case  in  Figure  6) . 

RESIDUALS  VS  INDEPENDENT  VARIABLE  X(I) 

The  pattern  of  the  residuals  in  the  plot  residuals  versus  independent 
variable  is  useful  in  determining  whether  other  functional 
forms  of  the  independent  variables  are  needed.  The  residuals  should 
be  equally  scattered  about  the  O-residual  line.  As  an  actual  example. 
Figure  9 is  a plot  of  the  residuals  versus  an  independent  variable  x 
where  obviously  a squared  term  is  needed  in  the  equation.  This  plot 
was  obtained  when  a fit 

y = b0  + blXl  + b2x2 

was  made  to  a set  of  data  when  the  true  form  of  the  equation  was 
y = bQ  + bjXj  + b 2 x 2 + bjxJ 

For  this  fit  however  the  ’global"  statistics  were  significant  and  did 
not  indicate  anything  wrong  with  the  fitted  equation.  In  particular 
R^  = .9047  and  the  F-VALUE  = 228.  Figure  10  is  another  example  plot 
where  the  equation 

y = b g + bjXj  + b2x|  + b3x^ 
was  fitted  to  data  and  the  true  form  was 
y = bg  + b j x j + b2x2  + b^ 

Here,  R2  = -9963  and  the  F-VALUE  = 3267.  These  two  simple  examples 

indicate  why  the  practice  of  considering  only  R*  and  the 

F-VALUE  as  measures  of  'goodness  of  fit"  is  not  statistically  sound. 

COMPONENT  AND  COMPONENT-PLUS-RESIDUALS  VS  INDEPENDENT  VARIABLE  X(I) 

The  component  versus  independent  variable  x^  is  a plot  of  the  component 

effect  C. . of  each  observation  on  each  variable  versus  the  indepen- 
13 

dent  variable.  The  component-plus-residuals  is  the  sum  of  the  component 
effects  of  each  observation  and  its  residual.  As  stated  in  QO , 
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DEP  VAP  1:  COST  RESIDUALS  VS.  INDEPENDENT  VARIABLE  It  NT 8F 
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independent  variable  it  htbe 


"Component-plus  residual  plots  are  used  as  an  aid 

(1)  to  choose  the  appropriate  form  of  the  equation, 

(2)  to  observe  the  distribution  of  the  observations 
over  the  range  of  each  independent  variable  and 

(3)  to  estimate  the  influence  of  each  observation 
on  each  component  of  the  equation." 

Observations  at  the  extreme  ends  of  the  ranges  of  the  independent 
variables  usually  control  the  estimates  of  the  statistics.  The 
component-plus-residuals  plots  can  be  used  (with  indicator  variables 
and  the  CD-search  technique)  to  determine  if  these  extreme  points  are 
compatible  with  the  remainder  of  the  data.  If  it  is  determined  that 
these  extreme  values  are  not  compatible  with  the  rest  of  the  data, 
then  either  curvature  should  be  introduced  in  that  independent 
variable  or  other  subjective  information  (introduced  by  indicator 
variables)  about  the  points  in  question  should  be  considered. 

Figure  11  is  a plot  for  the  training  cost  equation  obtained,  where  the 
independent  variable  is  the  % power  supply.  As  can  be  seen  from  tne 
graph,  only  one  observation  extends  the  range  of  % power  supply  by 
over  3,000%.  It  was  later  found  out  that  introducing  curvature  in  the 
% power  supply  had  a significant  impact  on  the  fit.  The  residuals  should 
be  equally  scattered  about  the  component  line. 

CP  VS  P 

The  Cp-plot  (developed  by  Mallows  [3])  , is  a plot  of  the  Cp-statistic  for 
an  equation  versus  P where  P = k + 1 . For  those  equations  with  negligible 
bias,  the  Cp-statistic  will  fall  near  line  Cp  = p.  Obviously  the  analyst 
would  like  to  choose  the  equations  with  smallest  total  squared  errors 
(Cp)  and  with  the  least  amount  of  bias.  Figure  12  is  an  example  of  a 
Cp-plot  for  the  NRTS  equation  obtained  where  it  can  be  seen  that  the 
Cp-statistic  1 is  on  the  line  Cp  = p which  indicates  that  there  is  no 
evidence  of  function  bias.  The  Cp-statistic  2 is  about  the  same  as  1 but 
is  above  the  line  Cp  = p and  indicates  that  more  is  present  in  equation  2 
than  equation  1. 

STATISTICAL. TECHNIQUES  . . . . ...  . - ..... 

There  are  two  techniques  utilized  that  are  helpful  in  finding  the 
subset  collection  of  variables  which  best  fits  the  data  and  in  deter- 
mining the  stability  of  the  equations  obtained. 
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CP-SEARCH  TECHNIQUE 


Two  approaches  that  have  been  widely  used  to  search  the  2 possible 
equations  for  the  "best"  combination  of  variables  are  called 
"Stepwise  Regression  (forward  and  backward)"  and  the  "F-test." 

Forward  stepwise  regression  introduces  the  independent  variables, 
one  at  a time,  into  the  equation  and  uses  a criterion  involving  the 
t^  values  to  determine  whether  or  not  should  be  left  in  the  equation. 
Backward  stepwise  regression  begins  with  the  complete  initial  set 
of  variables  and  drops  the  variables  by  using  a similar  criterion  as 
that  of  forward  stepwise  regression.  The  F-test  is  widely  used  to 
determine  the  significance  of  adding  an  independent  variable. 

Obviously,  these  two  techniques  do  not  search  all  the  2 possible 
equations,  but  only  a portion  of  them.  Moreover,  a search  with 
these  techniques  can  lead  to  different  results  when  the  independent 
variables  are  correlated  or  if  the  variables  are  introduced  in 
different  orders. 


With  the  eauation  form  assumed,  there  is  usually  some  smaller  subset  of 
these  variables  which  have  "very"  influential  effects  on  the  dependent 
This  subset  can  be  called  the  Basic  Set  of  Variables.  A search  proposed 
by  Daniel  and  Wood  CO  called  the  t . - directed  search  is  used  to 
determine  a Basic  Set.  With  the  Basic  Set  always  included,  a technique 
called  the  Cp-search  technique  is  used  to  search  up  to  218  = 262,144 
equations  for  the  best  combination  of  variables  which  gives  the  smallest 
Cp-statistics , and  hence  smallest  total  squared  error.  In  some 
cases  (e.g.,  when  the  wrong  form  of  the  equation  is  used)  there  is  no 
Basic  Set  of  Variables,  and  here  the  analyst  has  the  option  of  choosing 
a basic  set  (usually  those  variables  with  the  largest  tj_  - values) 
until  there  are  at  most  18  variables  remaining  to  be  searched  by  the 
Cp-search  technique.  This  method  is  called  Fractional  Replication 
(See  0.7  ) . The  Cp-search  technique  using  both  types  of  searches 
have  consistently  helped  to  narrow  down  the  initial  set. 

CROSS  VERIFICATION  OF  COEFFICIENTS 

Once  a presumably  final  equation  is  obtained,  the  analyst  must  deter- 
mine the  stability  of  the  obtained  equations  coefficients.  There 


may  be  a few  observations  in  the  data  base  (such  as  those  with 
large  WSS  DISTANCE  and  large  residuals)  that  are  not  compatible 
with  the  rest  of  the  data  and  may  be  controlling  the  estimates  of 
the  fitted  coefficients.  A way  to  determine  stability  is  to  drop 
those  observations,  run  another  regression  and  determine  the  effects 
on  the  least-squares  estimates  of  the  coefficients.  This  tech- 
nique is  called  cross  verification  of  coefficients  with  a second 
sample  of  data  and  provides  a rigorous  test  of  the  data,  the 
model  and  the  fitted  coefficients. 

Component-plus-residual  plots  of  the  second  sample  of  data  (where 

residuals  are  calculated  using  the  initial  coefficients)  may 

• • • • • • * . • ■ • • 

point  out  those  observations  which  may  indicate  that  other  forms 
of  curvature  are  needed. 


I 
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SECTION  VI 


AN  EXAMPLE, 

MEAN  TIME  BETWEEN  MAINTENANCE  ACTIONS  (MTBMA) 

The  data  (a  year's  worth  of  data)  was  chosen  from  existing  data 
systems  to  determine  the  CER's  and  PER's  used  to  estimate  O&M 
cost  and  is  shown  in  Appendix  A.  There  are  6 4 pieces  of  avionics 
equipment  (observations),  called  LRU's,  on  which  the  study  is 
based.  Each  observation  can  be  identified  by  its  observation 
number  and  Work  Unit  Code  (WUC) . Also  associated  with  each 
is  a total  of  27  variables,  of  which  21  are  independent  variables 
and  6 are  dependent  variables.  The  independent  variables  are 
of  two  types,  quantitative  and  qualitative.  The  usual  types  of 
variables  in  a regression  exercise  are  quantitative  (i.e., 
variables  that  may  take  on  values  over  a given  range)  such  as 
weight  or  other  physical  characteristics  of  the  equipment.  Many 
times  additional  (qualitative)  information  is  available,  such 
as  certain  characteristics  of  the  equipment  or  a certain  class 
in  which  the  equipment  belongs,  which  should  not  be  discarded, 
but  should  be  introduced  into  the  regression,  since  more  infor- 
mation should  lead  to  a better  fit.  "Indicator"  variables 
(variables  which  take  on  the  value  of  0 or  1)  are  used  to  intro- 
duce qualitative  information  into  the  regressions.  A "1"  indicates 
that  the  observation  is  in  a certain  class  and  a "0"  indicates 
that  it  is  not. 

The  type  of  aircraft  in  which  a piece  of  equipment  is  used  and  the 
equipment  avionics  areas  are  the  two  qualitative  classes  used  in 
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this  study.  There  are  three  types  of  aircraft:  Fighters,  Bombers 
and  Cargo  and  three  areas  of  avionics:  Navigation,  Sensory,  and 
Communications.  Table  1 shows  the  observation  numbers  of  each  piece 
of  equipment  and  the  class  to  which  it  belongs.  For  instance, 
observation  9 is  a piece  of  navigations  equipment  that  is  used  in  a 
fighter.  Since  there  is  no  sensory  equipment  used  in  cargo  type 
aircraft,  no  observations  are  present  there.  The  numbers  in  paren- 
thesis indicates  the  quantity  of  observations  in  each  category. 

Thus  18  LRUs  are  used  in  bombers  and  16  LRUs  are  sensory  type  equip- 
ment. The  numbers  in  the  corners  of  the  inner  rectangles  indicate  the 
number  of  observations  which  fall  in  the  respective  interactive 
classes.  There  are  4 observations  which  are  used  in  fighters  in 
addition  to  being  communications  equipment.  Table  2 lists  the  names 
of  all  the  variables  and  their  associated  variable  names  used  in  the 
computer  printouts  of  the  regressions.  Also,  listed  are  the  units  in 
which  the  variables  are  expressed.  The  quantitative  independent 
variables  and  the  dependent  variables  are  defined  in  Volume  I of  this 
report.  The  indicator  variables  (qualitative  independent  variables) 
need  some  further  clarification. 

BOMBER  and  CARGO  are  indicator  variables  used  to  represent  equipment 
that  is  used  in  bomber  and  cargo  type  aircrafts  respectively.  SENS 
and  COMM  are  indicator  variables  indicating  that  the  avionics  areas 
of  the  equipment  are  sensory  and  communications  respectively.  It  is 
noted  that  there  is  no  indicator  variable  for  fighter  aircraft  or 
navigation  equipment.  Using  indicator  variables  in  this  fashion,  that 
is  having  a "baseline"  of  each  class  or  category,  can  be  very  informa- 
tive. Without  loss  of  generality,  fighters  are  chosen  as  the  "base- 
line" for  the  aircraft  types  and  navigation  equipment  are  chosen 
as  the  "baseline"  for  the  avionics  area.  We  can  then  find  those 
members  of  a certain  class  that  are  significantly  different  from  the 
baseline.  As  an  example,  if  the  indicator  variable  BOMBER  is  significant 
enough  to  be  in  the  final  equations  for  MTBMA , this  may  be  interpreted 
to  mean  that  the  MTBMA  for  equipment  used  in  bombers  is  statistically 
significantly  different  from  the  MTBMA  for  equipment  used  in  fighters 
(the  baseline).  Conversely  if  COMM  does  not  remain,  this  indicates 
that  the  MTBMA  for  communications  equipment  behaves  in  a similar 
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TABLE  1 


Qualitative  Categories 


FIGHTERS 

(29) 


BOMBERS 

(18) 


CARGO 

(17) 


NAVIGATIONS 

(35) 


SENSORY 

(16) 


COMMUNICATIONS 

(13) 


48 


A 


TABLE  2. 


Variables  Used  in  the  Regressions 


Independent  Variables 


" Indicator" 

Quantitative 

X1M  = (BOMBER  - BOMBER) 

X8  =?  Unit  price 

X2M  = (CARGO  - CARGO) 

X9  = Volume 

X3M  = (SENS  - SENS) 

X10  = Weight 

X4M  = (COMM  - COMM) 

Xll  = Components  Count 

X5  = XI  x X3 

X12  = Components  density 

X6  = XI  x X4 

X13  = % Digital 

X7  = X2  x X4 

X14  = % Analog 

X15  = % Electro-mechanical 

X16  = % Power  supply 

X17  = % Transmitter 

X18  = % Solid  state 

X19  = Power  Dissipation 

X20  = Utilization  factor 

X21  = % BIT/FIT 

Dependent  Variables 


Y1  = Maintenance  Manhours  per  Operating  Hour  (MMH/OH) 
Y2  = Mean  Time  Between  Failure  (MTBF ) 

Y3  = Mean  Time  Between  Maintenance  Actions  (MTBMA) 

Y 4 = Logistics  Support  Cost  per  Operating  Hour  (LSC/OH) 
Y5  = Training  Cost  per  Operating  Hour  (TRAIN/OH) 

Y6  = Not  Repairable  This  Station  (NRTS ) 
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manner  as  that  of  navigation  equipment  (the  baseline) . There  also 
exists  the  possibility  that  communications  equipment  used  in  cargo 
type  aircraft  might  have  a significantly  different  effect  on  the 
dependent  variable  from  that  of  navigations  equipment  used  in 
fighters . 

Three  indicator  variables  (which  are  products  of  the  four  initial 
indicator  variables)  that  can  be  used  to  determine  the  effects  of 
such  interactions  are:  BOMBER  x SENS,  BOMBER  x COMM  and  CARGO  x 
COMM.  In  the  regressions  however,  the  7 indicator  variables  used 
are  the  ones  shown  in  Table  2 , where  the  bar  above  the  variables 
indicate  the  mean.  This  is  done  in  order  to  reduce  the  sometimes 
high  correlation  between  the  indicator  variables  and  their  products. 

If  any  of  the  three  interactive  variables  X5 , X6  and  X7  are  sig- 
nificant, then  the  analysis  of  the  MTBMA  example  (above  paragraph) 
will  have  quite  a different  interpretation.  If  interactions  prove 
to  be  significant  (as  was  the  case  in  all  regressions  equations 
obtained  in  this  study) , the  interpretation  of  how  the  levels  with 
which  the  two  classes  (aircraft  type,  avionics  area)  compare  with 
their  baseline  should  not  be  used,  but  the  analyst  should  find  a 
means  of  interpreting  which  specific  interactions  are  significantly 
different  from  which  others. 

Returning  to  Appendix  A,  we  see  three  lines  of  information  associated 
with  each  observation.  The  first  line  lists  the  7 indicator  variables, 
the  second  the  (quantitative)  independent  variables  and  the  third  line 
are  the  6 dependent  variables.  Thus  observation  2 is  a piece  of 
navigation  equipment  used  in  a fighter  with  weight  = 36.60,  % solid 
state  = 73.  and  MTBMA  = 274. 

Initially  85  LRUs  were  considered  for  the  study.  Many  of  the  obser- 
vations were  dropped  from  the  analysis  because  of  the  lack  of  data  or 
the  difficulty  in  obtaining  the  necessary  data.  Other  observations , such 
as  equipment  which  had  not  been  in  the  Air  Force  inventory  long  enough 
to  experience  "good"  data,  were  discarded  so  as  not  to  introduce  bias 
in  the  results.  In  addition  a couple  of  the  observations  had  missing 
dependent  variable  data  and  were  omitted  for  that  particular  fit. 

Once  all  the  data  has  been  collected,  there  should  be  a panel  of 
qualified  experts  on  the  studied  equipment  to  determine  the  validity 
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of  each  data  element  in  the  data  base.  This  was  done  to  a limited 
extent  in  this  study  (because  of  time  constraints) . 

Many  times  the  statistics,  plots,  tables  and  techniques  indicate 
that  some  observations  do  not  behave  like  the  remainder  of  the  data. 
Besides  other  possible  subjective  variables,  curvature  may  be  causing 
this  instability.  In  addition  to  the  variables  shown  in  Table  3, 
two  transformations  of  the  independent  variable  (the  square  and 
natural  logarithm)  and  a transformation  of  the  dependent  variable 
(natural  logarithm)  are  introduced  into  the  regressions  when 
curvature  is  indicated.  The  natural  logarithm  transformation  is  con- 
sidered for  those  variables  whose  range  is  contained  in  the  positive 
real  numbers. 

Since  the  LLSCFP  allows  six  card  columns  to  identify  the  names  of  the 
variables,  alphanumeric  representations  consisting  of  six  letters 
or  less  are  used  in  computer  printouts.  Using  X8  as  an  example,  the 
transformed  independent  variables  are  of  the  form 

X8M  = (X8  - X8)  , 

X8DSQ  = (X8  - d8) 2 , 

and  LNX8  = lnX8 

where  the  bar  indicates  the  mean,  d8  is  the  d-statistic  of  variable 
X8  and  In  is  the  natural  logarithm.  Table  3 shows  the  d-statistic  and 
mean  for  each  of  the  variables  used  in  the  regressions. 

Before  beginning  any  regressions,  the  data  must  be  critically  analyzed 
for  outliers  (impossible  values)  and  for  what  is  known  as  "Nested 
Data."  The  data  is  said  to  be  "Nested"  if  some  of  the  observations 
have  all  or  nearly  all  the  same  or  approximate  x^-values.  Obviously 
outliers  would  have  a significant  impact  on  the  fitted  coefficients, 
thereby  yielding  the  incorrect  relationships.  If  the  equations  are 
fitted  without  checking  to  determine  whether  or  not  the  data  are  nested, 
the  wrong  factors  may  be  significant.  The  analysis  of  "Nested  Data" 
was  first  introduced  into  the  statistical  literature  by  Daniel  and 
Wood  ( [l]  , Chapter  8)  . 

Since  there  is  human  intervention  in  the  development,  collection  and 
analysis  of  data,  outliers  might  not  immediately  become  apparent. 
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TABLE  3 


' 


y 

rjm/uii 


D-Statistic  and  Mean 


MTBF 


I 

MEA*  ID 

OSTATCIl 

I 

MEAN(I) 

OSTATtn 

a 

27173.650 

133726,480 

8 

27536,1 00 

133828. 6*0 

9 

1 430.346 

3307.065 

9 

1483.726 

3337.016 

10 

34.991 

64.404 

10 

36,221 

65.0*8 

1 1 

909,089 

2948.044 

11 

940.816 

2965.510 

12 

0.924 

2.489 

12 

0.924 

2.489 

13 

7.556 

43.028 

13 

7.677 

43.038 

1 4 

62.683 

49.183 

14 

61,677 

48,615 

15 

16.000 

46.370 

15 

16,258 

46,403 

16 

3,429 

49.827 

16 

3.484 

49.829 

17 

10.492 

40,395 

17 

11,065 

40.258 

19 

61.297 

51.914 

18 

61,044 

52.254 

19 

360,9** 

723,684 

19 

382.419 

729.178 

20 

1.657 

1.681 

20 

1.645 

1.679 

21 

4,825 

26.957 

21 

4.694 

27.148 

■ !TBMA 

LSC/OH 

1 

1EANCIJ 

DSTATCI) 

1 

MEAN  Cl) 

OSTATen 

8 

o 

27536, i 88 

8 

26943.220 

133606,300 

133828.600 

9 

1 457.984 

3311 .538 

y 

1483.726 

3337,016 

1 0 

35,578 

64.314 

t 0 

36.221 

6 5 , 8 R 8 

1 1 

927,746 

2 956.742 

! 1 

940,016 

2965.510 

12 

0,933 

2.501 

1 2 

0.924 

2.4*9 

13 

7.556 

43.02* 

1 3 

7.677 

43.036 

1 4 

63.349 

48,895 

1 4 

61  ,677 

48.615 

15 

14.937 

46,991 

1 5 
lfe 

15.258 

46.403 

16 

3.429 

49.897 

3.484 

4Q . *?g 

17 

10.889 

48.172 

1 7 

11.065 

40,258 

1 8 

61,138 

51.898 

1 8 

61.844 

52.254 

19 

374.159 

722,249 

1 9 

3*2,419 

729.178 

20 

1 .639 

1.681 

2 1 

1.645 

4,694 

1 .679 

27, 146 

21 

4,556 

27,288 
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table 

3 

D-Statistic  and 

Mean 

(Con' t. ) 

TRAIN/OH 

NRTS 

I 

ME4M(  I) 

OSTAT  (I) 

I 

MEAN  (T ) 

DSTAT  cii 

8 

26950,100 

133922, 500 

8 

27288,080 

133763,300 

9 

1445.145 

3305.984 

9 

1475.339 

3321,284 

10 

34  955 

64.418 

10 

36,00.5.  . 

.64,594 

1 1 

889.468 

2982.834 

1 1 

936,758 

2961,209 

12 

0.904 

2.6?? 

12 

0,938 

2.5«5 

13 

7,677 

43.038 

13 

7,677 

43,036 

14 

62.518 

49.242 

14 

63,468 

49.468 

15 

15.823 

46.514 

15 

14,468 

45,651 

16 

3,484 

49.828 

16 

3,484 

49.828 

17 

10.661 

40,474 

17 

1 1 ,065 

40,258 

18 

61,108 

51,985 

18 

62,108 

52.151 

19 

372,097 

723,719 

19 

378,529 

723,518 

20 

1 ,646 

1,680 

20 

1,645 

1.679 

21 

4,839 

26,952 

21 

4,654 

27.149 
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However,  the  computerized  plots  of  an  initial  regression  may  point 
them  out.  Figures  13-17  show  summary  computer  printouts  of  an 
initial  run  (Pass  03)  after  fitting  the  equation 

21 

y = b0  + £ blX. 

i = 1 

where  Y = In  MTBMA  and  x^  = Xi , i = 1,  2,  . ..,  21,  are  the  21  independent 
variables.  The  statistics  (Figure  13)  are  not  significant  and  the 
stars,  *,  indicate  error  is  too  large  to  be  printed  out  in  the  space 
provided.  The  cumulative  standard  deviation  table  (Figure  15)  shows 
that  there  is  a serious  lack  of  fit.  The  cumulative  distribution  plot 
(Figure  16)  and  the  fitted  value  plot  (Figure  17)  indicate  that  there 
is  a single  oversized  residual  (possibly  an  outlier) . Since  there  is 
also  the  possibility  of  curvature,  a fit  is  made  with  Y = In  Y3  = 

In  MTBMA  with  printouts  shown  in  Figure  18-22.  The  statistics  using 
the  natural  logarithm  are  greatly  improved  but  Figures  21  and  22  also 
(although  not  as  profound  as  before)  indicate  that  an  oversized  residual 
is  present. 

In  Figures  14  and  19 , under  the  heading  "ORDERED  BY  RESIDUALS" , we 
immediately  see  that  observation  36  with  WUC  7171A  is  the  culprit. 

In  addition  Figure  13  shows  that  the  maximum  value  of  the  dependent 
variable,  MTBMA  is  31,490  hours,  and  hence  there  is  an  observation 
(piece  of  equipment)  in  data  base  with  such  a large  MTBMA.  On  in- 
vestigating the  data  collection  system  ( AFM  66-1  "6-LOG")  from  which 
the  MTBMAs  were  extracted,  it  was  found  that  observation  36  with 
MTBMA  = 31,490  was  a subassembly  of  an  LRU  (called  an  SRU)  and  not 
a LRU.  Since  the  study  is  based  on  LRUs,  leaving  this  particular 
observation  in  would  bias  the  results , and  therefore  another  obser- 
vation with  WUC  71716  was  used  in  its  place. 

In  addition  to  impossible  values  and  discrepancies  in  the  data  sys- 
tems, outliers  may  be  caused  by  simple  keypunch  errors.  If  those 
errors  are  profound,  the  computerized  plots  should  reflect  the  dis- 
crepancies. In  many  of  the  regressions  considered  throughout  this 
report,  the  plots  were  helpful  in  detecting  those  observations  which 
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cumulative  distribution  of  residuals 


CUMULATIVE  FREQUENCY,  NORMAL  CRIO 


3 OEP  VAR  21  L*»J  RE5I0UAL  VS,  FITTED  T 

4,»l*  8,2*9  MM 


FIGURE  22 


had  not  been  in  the  Air  Force  inventory  long  enough  to  experience 
"good"  data.  These  observations  were  dropped  from  the  analysis  in 
order  to  reduce  bias. 

As  a means  of  determining  whether  the  data  are  "nested",  a computer 
program  was  written  to  sort  the  data  table  by  each  independent 
variable,  x^.  For  instance.  Table  4 is  a printout  of  the  data  ranked 
by  the  unit  price.  As  can  be  seen  there  are  only  a few  observations 
with  the  same  x^  values,  and  hence  no  evidence  of  serious  nesting 
exists.  If  it  is  determined  that  the  data  are  nested,  two  fittings 
must  be  made:  one  on  the  nested  data  "within  plots"  and  one 
"among  plots".  (See  [1],  Chapter  8). 

After  the  data  has  been  critically  examined,  and  alternate  LRUs 
considered  where  necessary,  we  are  now  ready  to  begin  the  regressions. 
As  before,  a fit  is  made  (Pass  3)  where  Y3  is  the  dependent  variable 
(see  Figures  23-27) . Note  that  the  statistics  in  Figure  23  have 
been  greatly  improved  over  that  of  Figure  13,  which  indicates  the 
power  of  an  outlier  in  the  data.  The  fitted  values  plot  (Figure 
27),  however,  shows  a strange  trend  in  the  residuals,  that  possibly 
the  natural  logarithm  function  can  straighten  out.  Figures  28-32 
show  the  results  of  fitting  the  same  independent  variable  but  with 
In  Y3  as  the  dependent  variable.  Here  the  statistics  are  slightly 
better  than  those  of  Figure  23,  but  the  cumulative  distribution  plot 
shows  an  approximate  straight  line  and  the  trend  in  the  fitted  values 
plot  has  disappeared. 

The  anxious  analyst  might  feel  at  this  point  that  the  use  of  the 
natural  logarithm  transformation  in  the  dependent  variable  is  the 
appropriate  form  to  use  the  equation.  It  might  be,  however,  that 
other  transformation  (curvature)  in  the  independent  variables  are 
needed  in  the  equation  to  determine  the  correct  form  of  the  dependent 
variable.  Consequently,  both  forms  of  the  dependent  variable 
should  be  fitted  simultaneously,  analyzing  the  statistics,  plots  and 
tables  at  each  stage  and  introducing  curvature  whenever  the  statis- 
tics indicate.  In  developing  several  of  the  fitted  equations  such 
as  MMH/OH , LSC/OH , and  TRAIN/OH,  the  statistics  did  not  definitely 
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pointout  which  form  of  the  dependent  variable  was  appropriate 
until  the  final  iterations  were  made. 


After  much  analysis  in  the  MTBMA  fit,  it  was  discovered  that  In  MTBMA 
is  the  more  appropriate  form  of  the  dependent  to  be  used.  We 
will  then  proceed  with  the  sketch  of  the  In  MTBMA  fit.  Figures 
33-42  are  component-plus-residual  plots  for  the  ten  independent 
variables  X8,  X9 , X10,  Xll,  Xl2,  X16 , X17,  X19 , X20  and  X21.  Fig- 
ure 33  shows  that  there  are  5 observations  which  extend  the  range 
of  X8  by  300%.  It  must  be  determined  whether  these  observations 
behave  like  the  rest  of  the  data  (thereby  extending  the  range  of 
variable  X8)  or  whether  they  are  not  consistant  with  the  remainder 
of  the  data  (possibly  indicating  curvature).  Table  5 shows  the  ten 
independent  together  with  the  observations  which  extend  the  respec- 
tives  ranges.  These  observations  numbers  can  be  obtained  by  using 
the  component-plus-residual  plots  and  the  tables  of  data  ranked  by 
each  independent  variable. 


Indicator  variables  in  conjunction  with  the  Cp-search  technique 
can  be  used  to  determine  the  effects  of  such  extended  observations 
(see  [4]).  The  approach  is  to  define  indicator  variables,  X22 
through  X31 , denoting  those  observations  which  extend  the  variables 
shown  in  Table  5,  then  multiply  these  indicator  variables  by  each 
of  their  respective  independent  variables,  and  use  the  Cp-search 
technique  to  determine  if  any  of  these  interactions  are  significant. 
If  any  of  these  products  prove  to  be  significant,  then  curvature 
will  be  introduced  in  the  variables,  since  we  have  assumed  that 
7 initial  indicator  variables  sufficiently  describes  all  qualitative 
information  about  the  observations.  This  is  displayed  in  Figure 
43,  where  for  instance  X22  is  an  indicator  variable  indicating  obser- 
vations 49,  31,  46,  47  and  34  and  X22X8  is  the  product  of  X22  and  X8. 


There  are  31  independent  variables  shown,  of  which  only  one  belongs  to 
the  "basic  set,"  and  leaves  30  variables  to  be  searched.  The  approach 


4 "The  Use  of  Individual  Effects  and  Residuals  in  Fitting  Equations 
to  Data,"  F.  S.  Wood,  Technometric^,  15,  No.  4,  (1973),  pp.  677-695. 
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independent  variable  si 


01»  NOOEL  PASS  S OEP  VAR  II  INY3  COMPONENT  nUh  RESJOUALS  Vi,  INDEPENDENT  vaRIaBLEIBI  *1* 

C • COMPONENT,  » • HlTM  RESIDUAL.  • • BOTH 

J.JBB  -I4.3is  B7.4BB  1SB.375  1T3.TBB 


INDEPENDENT  VARIABLE  101 


82 


0*n  NOOEL  PASS  5 OEP  VAR  21  LRV3  COMPONENT  NITM  RESIDUAL*  VS.  INDEPENDENT  VARIABLE!*!  >1* 

C • COMPONENT,  * • NITM  RESIDUAL,  • • BOTH 

P.BBB  415.252  823,320  1231,730  1640,000 


INDEPENDENT  VARIABLE!*! 
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TABLE  5 


OBSERVATIONS  THAT  EXTEND  THE  RANGES  OF  THE  VARIABLES 


X8 

- 

31, 

34, 

46, 

47 

X9 

- 

18, 

22, 

* 

00 

CM 

47 

X10 

- 

22, 

47, 

49 

Xll 

- 

29, 

31, 

46 

X12 

- 

68 

* 

X16 

- 

38, 

48 

X17 

- 

42, 

47 

X19 

- 

18, 

28, 

45, 

46 

X20 

- 

45, 

54, 

55 

X21 

— 

46, 

47, 

48 
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used  to  search  these  variables  is  to  utilize  the  fractional  replica- 
tion technique  twice,  where  12  variables  are  searched  at  each  stage. 
First  of  all,  the  12  variables  numbered  3,  7,  9,  12,  16,  17,  18,  21, 
23,  26  , 27,  31  have  the  smallest  ^-values  (see  Figure  43).  The  CP 
search  technique  indicated  that  none  of  the  12  variables  searched  are 
significant  enough  to  remain  in  the  equations.  These  variables  are 
therefore  dropped  and  the  remaining  variables  are  fitted  (Figure  44). 
Next,  the  twelve  variables  1,  2,  3,  5,  6.  8,  9,  10,  12,  13,  16.  19 
are  put  through  the  Cp  search  technique.  This  time  only  vari- 
ables 3,  6,  10,  12,  and  13  are  significant  and  the  results  are  printed 
in  Figure  45. 

We  note  that  the  variables  X22X8,  X24X10,  X28X17  and  X29X19  remain 
in  the  resulting  equation.  Therefore,  curvature  in  the  form  of  square 
and  natural  logarithms  for  variables  X8 , X10,  X17  and  X19  are  intro- 
duced into  the  regressions  as  shown  in  Figure  46.  Since  X17  is  the 
% transmitter,  there  is  no  logarithm  used.  In  this  pass,  there  are 
two  variables  in  the  basic  set.  Since  there  are  obviously  some 
variables  of  negligible  influence  (see  the  T-VALUE  and  REL.  INF.  X ( I ) ) 
a search  must  be  made.  Again,  using  a double  fractional  factorial 
search,  the  Cp-search  technique  admits  only  the  12  variables  shown 
in  Figure  47.  (Pass  39).  Additional  outputs  of  Pass  39  are  shown 
in  Figures  48-51.  Since  the  residual  route  mean  square  = .43,  the 
cumulative  estimates  of  the  standard  deviation  indicates  that  there 
is  little  evidence  of  lack  of  fit.  There  are  no  observations  far 
from  the  "centroid"  of  all  observations.  The  cumulative  distribution 
of  residuals  plot  is  now  a straight  line  and  the  fitted  y values 
have  a nice  even  scatter  about  the  0 - residual  line. 


There  is,  however,  one  observation,  9,  which  has  a larger  (smaller) 
residual  than  all  the  other  observations  (Figure  48) . This  observa- 
tion may  be  controlling  the  estimates  of  the  coefficients.  To 
determine  if  observation  9 is  in  fact  controlling  some  of  the 
coefficients,  a cross  verification  of  coefficients  is  performed  as 
shown  in  Figure  52,  where  the  statistics  at  the  top  are  calculated 
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FIGURE  47 
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FIGURE  52 


using  the  entire  set  of  data  and  the  statistics  at  the  bottom  are 
those  calculated  with  observation  9 omitted.  Here  we  note  that 
the  coefficient,  for  X18  has  decreased  by  nearly  50%,  and  the  co- 
efficient for  X21  has  decreased  by  nearly  15%.  Sometimes  the  effects 
of  the  observations  on  a certain  coefficient  are  lying  hidden  beneath 
other  effects,  and  hence  the  effects  on  the  coefficients  for  X21 
may  be  larger  than  the  statistics  indicate.  Therefore,  in  addition 
to  considering  curvature  for  the  previous  variables  X8 , X10 , X17  and 
X19,  we  consider  curvature  for  X18  and  X21.  If  curvature  is  not 
needed  in  variable  X21,  the  Cp  search  technique  should  omit  it. 

This  exercise  is  shown  in  Figure  53.  Again  a fractional  factorial 
search  performed  twice,  yields  the  results  in  Figures  54-59. 

The  statistics  are  highly  significant,  = .9183  and  F-VALUE  = 

41.5,  the  residuals  are  now  evenly  distributed,  (indicating  constant 
variance  of  o2(y)  ),  the  cumulative  distribution  is  a straight  line 
(indicating  normality)  and  the  cumulative  estimates  of  the  standard 
indicates  that  there  is  no  evidence  of  lack  of  fit. 
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SECTION  VII 


CONCLUSIONS  AND  RECOMMENDATIONS 


The  results  (Summary  Computer  Printouts)  of  the  five  remaining 
relationships  obtained  by  the  techniques  of  regression  analysis  are 
shown  in  Appendix  B.  Table  6 gives  a summary  of  these  results,  in- 
dicating the  transformation  of  the  dependent  variable,  the  multiple 
correlation  coefficient  squared,  the  F-VALUE,  the  residual  root  mean 
square,  the  standard  deviation  estimated  from  residuals  of  neighboring 
observations,  the  Normal  plot,  the  Fitted  Y plot  and  the  observations 
with  large  WSSD.  The  four  parameters  MMH/OH , MTBF , MTBMA  and  LSC/OH, 
are  the  major  drivers  of  O&M  cost  and  were  therefore  given  more 
attention.  Each  equation  was  approached  as  the  statistics,  tables, 
computerized  plots  and  techniques  directed.  As  in  the  MTBMA  example, 
the  WSSD  was  not  used  in  determining  the  most  appropriate  form  of  the 
equation,  but  this  statistic  as  well  as  other  statistics  were  utilized 
in  many  of  the  other  fits. 

The  MMH/OH,  MTBF,  MTBMA  and  TRAIN/OH  equations  all  show  significant 
results  with  no  indication  of  serious  lack  of  fit.  The  LSC/OH  equa- 
tion, although  significant,  had  one  observation,  4,  which  had  the 
largest  residual  in  each  of  the  intermediate  fits,  irregardless  of 
the  functional  form  considered.  Observation  4,  with  WUC  71H60,  is 
a Gyroscope  platform  (navigations  equipment)  used  in  the  F4E.  Inves- 
tigation into  the  data  elements  for  this  LRU  did  not  indicate  errors 
in  the  collected  data.  Consequently,  observation  4 remained  in  the 
data  base,  since  there  is  not  enough  known  about  avionics  equipment 
and  the  form  that  the  LSC/OH  equation  should  be  to  deem  it  an  outlier. 
The  residual  root  mean  square  of  .52  indicates  a slight  lack  of  fit 
as  the  cumulative  standard  deviation  estimated  from  near  neighbors  is 
.57.  This  lack  of  fit  may  be  caused  by  the  data  and  other  variables 
and  transformations  not  considered  in  this  study. 

The  NRTS  equation  was  one  of  the  more  difficult  equations  fitted. 

The  statistics  are  barely  significant  and  the  stability  of  the  coef- 
ficients was  not  attained.  Among  the  many  reasons  for  this  is  that 
NRTS  is  highly  dependent  on  many  other  factors  not  considered  in 
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this  study  (most  are  subjective) . Leaving  out  influential  variables 
can  make  other  collections  of  variables  appear  significant  when  in 
fact  they  are  not.  Although  there  is  a serious  lack  of  fit  in  the 
NRTS  equation,  15.5  versus  7.1,  the  results  are  still  useful,  since 
only  large  differences  in  NRTS  cause  significant  changes  in  the 
total  number  of  spares  estimated  by  the  EBO  routine. 

Because  of  time  constraints,  one  area  not  touched  upon  in  this  study 
is  that  of  "prediction  intervals."  The  prediction  intervals  depend 
on  the  standard  error  or  variance  of  the  fitted  equation.  The  formula 
for  computing  this  variance  is  rather  complicated  and  is  dependent 
on  the  residual  mean  square,  the  number  of  observations,  the  ith 
diagonal  elements  of  the  inverse  matrix  and  the  spread  of  the 
independent  variables.  Although  the  LLSCFP  does  not  compute  this 
variance,  we  recommend  the  simple  bounds  suggested  by  Daniel  and 
Wood  ( [ l ] , page  55). 

Another  area  not  discussed  is  that  of  error  in  the  independent  variables 
(Assumption  A3) . Some  notable  contributions  on  the  subject  of  error 
in  the  independent  variables  has  been  made  for  the  case  of  one  indepen- 
dent variable  (See  Bibliography;  Acton,  Hocking  and  Leslie,  Mandansky) . 
It  appears,  however,  that  there  are  no  results  now  in  the  statistical 
literature  that  lend  to  practical  applications  when  multiple  variables 
are  considered. 

As  previous  experience  indicates,  many  of  the  logistics,  cost  and 
support  parameters  considered  in  this  study  are  difficult  parameters 
to  estimate,  especially  Field  MTBF , i.e.,  the  value  actually  achieved. 
MTBF  is  usually  the  major  cost  and  risk  driver  in  resource,  warranty 
and  maintenance  models.  MTBF  is  estimated  from  MIL-STD-217B  and 
other  reliability  documents  based  on  the  proposed,  detailed  system 
conf iguration.  But  the  configuration  and  other  parameters  which 
define  MTBF  are  not  usually  well  defined  in  the  early  proposal  phase. 
Previous  predictions  for  Field  MTBF  in  the  conceptual  phase  were  off 
by  several  orders  of  magnitude,  which  indicates  the  risk  involved 
when  using  these  predictions. 
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The  estimating  relationships  obtained  in  this  study  were  put  through 
critical  statistical  examinations  and  covered  a wide  range  of  possible 
functional  forms.  Although  the  results,  statistical  and  validation 
(Volume  I),  were  quite  encouraging,  there  are  still  areas  for  improve- 
ment that  warrant  further  study  to  increase  the  prediction  capability 
of  the  equations  obtained. 

The  first  and  major  recommendation  is  to  expand  the  data  base.  Although 
the  data  base  used  indicated  that  relationships  did  exist,  more  data 
would  lend  to  convergence  of  the  "true"  functional  forms.  By  expan- 
ding the  data  base  we  mean  more  data,  more  independent  variables, 
extending  the  ranges  of  the  variables  and  expanding  to  newer  tech- 
nology areas.  Other  variables,  not  considered  in  this  study,  that 
may  have  an  influential  effect  on  the  dependent  variable,  should  be 
introduced  into  the  regressions,  so  as  to  reduce  bias  and  improve 
the  prediction  capability  of  the  equations.  Some  of  the  variables 
may  not  have  been  experienced  over  a range  adequate  enough  to  display 
their  influence.  Extending  the  ranges  of  the  variables  and  using 
newer  equipment  in  the  data  base  will  enhance  the  capability  of 
predicting  advanced  equipment  costs. 

The  second  recommendation  is  to  refine  the  data  base.  This  includes 
investigation  into  other  Data  Collection  Systems  to  obtain  more 
sound  and  up-to-date  data.  Moreover,  it  is  recommended  that  a panel 
of  qualified  experts  on  the  studied  equipment  be  formed,  to  deter- 
mine the  validity  of  each  data  element. 

The  third  recommendation  is  to  consider  more  transformation  of  the 
variables.  The  transformations  considered  covered  a wide  range  of 
possible  forms,  but  there  are  many  other  transformations  that  may 
better  approximate  the  more  complicated  cases.  For  instance,  some 
of  the  independent  variables  were  percentages  which  covered  a wide 
range  of  values.  The  Inverse  Sine  transformation  can  be  used  to 
weigh  more  heavily  the  small  percentages  which  have  small  variance. 

In  addition  cross  products  of  the  variables  can  also  be  considered 
as  viable  transformations.  Again  there  must  be  more  data  available 
to  give  the  analyst  the  flexibility  needed  to  consider  many  different 
functional  forms. 
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The  fourth  recommendation  is  to  consider  other  subset  collections  of 

the  final  collection  of  variables  obtained  for  each  equation.  The 

Cp-search  technique,  in  addition  to  finding  those  collections  of 

variables  which  have  the  smallest  total  squared  error,  finds  other 

subsets  of  these  variables  and  ranks  them  according  to  their  C^-values. 

Sometimes  these  subcollections  have  approximately  the  same  C -statistic 

P 

and  variance  of  prediction  as  the  final  collection.  This  will  greatly 
enhance  the  flexibility  of  the  use  of  the  equations  and  the  ALPOS 
model,  in  that  some  of  the  values  of  the  more  difficult  to  obtain 
variables  may  not  be  needed  to  make  satisfactory  predictions. 

The  fifth  recommendation  is  to  investigate  the  possibilities  of 
considering  Non-linear  Regression  Analysis  as  a means  of  determining 
the  correct  functional  form  of  the  equations.  Although  the  relation- 
ships considered  in  this  study  covered  a wide  range  of  possible 
functional  forms.  Non-linear  Regression  Analyses  can  be  used  to 
approximate  even  more  complicated  cases. 

Although  many  Logisticians  feel  that  predicting  Logistics  costs  by 
the  techniques  of  Regression  Analysis  is  not  a viable  approach  to 
take  (usually  because  of  inconsistencies  in  the  data  collection 
systems),  the  statistics  and  validation  results,  however,  indicate 
the  great  possibilities  ahead.  We  feel  that  this  study  has  signifi- 
cantly contributed  to  the  art  of  estimating  advanced  avionics 
equipment  costs  early  in  the  conceptual  phase.  The  stage  has  been 
set,  the  statistics  defined,  the  approach  outlined,  the  results 
displayed  and  the  recommendations  made.  There  is  a light  shining 
across  the  horizon.  The  challenge  is  to  reach  that  light. 
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SECTION  VIII 


A TECHNICAL  NOTE  ON  REGRESSION  ANALYSIS 

The  name  Regression  Analysis  is  associated  with  the  names  Analysis  of 
Variance  and  Analysis  of  Covariance.  There  is  not  a very  acute  dis- 
tinction between  these  three  types  of  analyses.  According  to 
Scheffe  (see  Bibliography) , the  distinction  can  be  made  that  in 
the  analysis  of  variance  all  independent  variables  are  qualitative, 
in  regression  analysis  all  independent  variables  are  quantitative, 
whereas  in  the  Analysis  of  Covariance  the  independent  variables  are 
both  qualitative  and  quantitative.  According  to  this  slight  dif- 
ference, we  might  say  that  the  analyses  presented  in  this  report 
fall  under  the  realm  of  analysis  of  covariance.  Regression  analysis, 
however,  can  be  used  to  consider  all  three  types  of  problems. 

Analysis  of  variance  is  used  to  determine  if  significant  differences 
exist  between  the  means  of  different  populations.  For  instance,  we 
may  want  to  know  if  there  is  a significant  difference  between  the 
average  MTBF  of  equipment  used  in  fighters  and  that  of  equipment  used 
in  bomber  or  cargo  type  aircraft.  If  the  statistics  indicate  that 
a significant  difference  exists,  the  problem  is  then  to  find  which 
aircraft  type  is  "more"  signif icantly  different  from  the  "baseline." 
Analysis  of  variance  depends  on  different  methods  of  comparison  to 
determine  the  "least"  and/or  "most"  significant  differences.  Two 
notable  procedures  that  have  been  developed  are  Tukey ' s method  and 
Scheffe 's  Method  for  Multiple  Comparisons  (See  Guenther).  Some  recent 
approaches  have  been  the  Least  Significant  Difference  and  Duncan's 
Multiple  Range  Test  (See  Adler  and  Roessler) . 

The  problem  of  finding  the  "most"  significant  differences  becomes  more 
difficult  to  disentangle  when  more  categories  are  used  (such  as  the 
avionics  area)  and  interactions  are  considered.  However,  if  an 
analysis  of  variance  problem  is  solved  using  the  statistics  and 
techniques  presented  on  regression  analysis,  much  more  information  can 
be  obtained.  The  procedure  is  to  consider  indicator  (independent) 
variables  to  represent  the  qualitative  classes  (with  an  assumed  base- 


are  not  significant  (i.e.,  a bad  fit) , we  can  say  that  there  is  no 
significant  differences  in  the  means.  If  the  statistics  indicate  a 
significant  fit,  then  there  are  some  classes  or  interactions  with 
significantly  different  means.  If  interactions  are  not  among  the 
variables  admitted  by  the  Cp-search  technique  (usually  admits  those 
variables  with  largest  t-values  and  relative  influence) , then  the 
variable  which  causes  the  most  significant  drop  in  the  Cp-values 
gives  a good  indication  of  the  class  which  is  "most"  different.  If 
interactions  prove  significant,  another  fit  is  to  be  made  with  dif- 
ferent indicator  variables  representing  the  interactive  classes 
(with  an  assumed  interactive  baseline) , and  use  the  statistics  and 
Cp-search  technique  to  determine  which  specific  interactive  classes 
are  significantly  different  from  which  others.  In  addition,  useful 
information  can  be  extracted  from  other  statistics,  tables  and  com- 
puterized plots,  such  as  the  coefficients  of  the  indicator  variables, 
the  "Component  Effects"  Table  and  the  component-plus-residual  plots. 

There  are  also  elementary  statistical  hypothesis  type  problems,  such 
as  testing  the  hypothesis  that  the  mean  of  a certain  sample  is  equal 
to  a specified  value  against  the  alternative  that  it  is  not,  that  can 
be  handled  by  the  techniques  of  regression  analysis,  if  the  variables 
are  properly  defined  (again  more  information  can  be  obtained) . 

Thus,  we  see  the  wide  range  of  possible  applications  of  Regression 
Analysis . 

Many  examples  in  standard  statistics  books  as  well  as  some  advanced 
books  on  statistical  hypothesis  testing,  analysis  of  variance  and 
analysis  of  covariance,  have  been  investigated.  Using  the  approaches 
outlined,  above,  the  results  have  been  similar.  The  author  is  planning 
a paper  in  the  near  future  showing  the  results  of  the^e  investigations. 
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